Skip to content

Commit

Permalink
Update to observability docs for OTEL (#2876)
Browse files Browse the repository at this point in the history
* otel doc

Signed-off-by: msfussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/operations/monitoring/metrics/metrics-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/operations/monitoring/tracing/otel-collector/_index.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/operations/monitoring/tracing/otel-collector/open-telemetry-collector.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/operations/monitoring/tracing/otel-collector/open-telemetry-collector.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/operations/monitoring/tracing/setup-tracing.md

Co-authored-by: Yaron Schneider <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Fixed URL address

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/developing-applications/building-blocks/observability/tracing-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Apply suggestions from code review

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

* Update daprdocs/content/en/operations/monitoring/metrics/metrics-overview.md

Co-authored-by: Hannah Hunter <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>

Signed-off-by: msfussell <[email protected]>
Signed-off-by: Mark Fussell <[email protected]>
Co-authored-by: Yaron Schneider <[email protected]>
Co-authored-by: Hannah Hunter <[email protected]>
  • Loading branch information
3 people authored Oct 12, 2022
1 parent 8f08e68 commit 4d860db
Show file tree
Hide file tree
Showing 29 changed files with 219 additions and 598 deletions.
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,11 @@ daprdocs/public
daprdocs/resources/_gen
.venv/
.hugo_build.lock
.dccache
.dccache
.DS_Store
daprdocs/.DS_Store
daprdocs/content/.DS_Store
daprdocs/content/en/.DS_Store
daprdocs/resources/.DS_Store
daprdocs/static/.DS_Store
daprdocs/static/presentations/.DS_Store
29 changes: 12 additions & 17 deletions daprdocs/content/en/concepts/observability-concept.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,41 +4,36 @@ title: "Observability"
linkTitle: "Observability"
weight: 500
description: >
Monitor applications through tracing, metrics, logs and health
Observe applications through tracing, metrics, logs and health
---

When building an application, understanding how the system is behaving is an important part of operating it - this includes having the ability to observe the internal calls of an application, gauging its performance and becoming aware of problems as soon as they occur. This is challenging for any system, but even more so for a distributed system comprised of multiple microservices where a flow, made of several calls, may start in one microservices but continue in another. Observability is critical in production environments, but also useful during development to understand bottlenecks, improve performance and perform basic debugging across the span of microservices.
When building an application, understanding how the system is behaving is an important part of operating it - this includes having the ability to observe the internal calls of an application, gauging its performance and becoming aware of problems as soon as they occur. This is challenging for any system, but even more so for a distributed system comprised of multiple microservices where a flow, made of several calls, may start in one microservice but continue in another. Observability is critical in production environments, but also useful during development to understand bottlenecks, improve performance and perform basic debugging across the span of microservices.

While some data points about an application can be gathered from the underlying infrastructure (e.g. memory consumption, CPU usage), other meaningful information must be collected from an "application-aware" layer - one that can show how an important series of calls is executed across microservices. This usually means a developer must add some code to instrument an application for this purpose. Often, instrumentation code is simply meant to send collected data such as traces and metrics to an external monitoring tool or service that can help store, visualize and analyze all this information.
While some data points about an application can be gathered from the underlying infrastructure (for example memory consumption, CPU usage), other meaningful information must be collected from an "application-aware" layer - one that can show how an important series of calls is executed across microservices. This usually means a developer must add some code to instrument an application for this purpose. Often, instrumentation code is simply meant to send collected data such as traces and metrics to observability tools or services that can help store, visualize and analyze all this information.

Having to maintain this code, which is not part of the core logic of the application, is another burden on the developer, sometimes requiring understanding the monitoring tools' APIs, using additional SDKs etc. This instrumentation may also add to the portability challenges of an application, which may require different instrumentation depending on where the application is deployed. For example, different cloud providers offer different monitoring solutions and an on-prem deployment might require an on-prem solution.
Having to maintain this code, which is not part of the core logic of the application, is a burden on the developer, sometimes requiring understanding the observability tools' APIs, using additional SDKs etc. This instrumentation may also add to the portability challenges of an application, which may require different instrumentation depending on where the application is deployed. For example, different cloud providers offer different observability tools and an on-prem deployment might require an on-prem solution.

## Observability for your application with Dapr
When building an application which leverages Dapr building blocks to perform service-to-service calls and pub/sub messaging, Dapr offers an advantage with respect to [distributed tracing]({{<ref tracing>}}). Because this inter-service communication flows through the Dapr sidecar, the sidecar is in a unique position to offload the burden of application-level instrumentation.
When building an application which leverages Dapr API building blocks to perform service-to-service calls and pub/sub messaging, Dapr offers an advantage with respect to [distributed tracing]({{<ref tracing>}}). Because this inter-service communication flows through the Dapr sidecar, the sidecar is in a unique position to offload the burden of application-level instrumentation.

### Distributed tracing
Dapr can be [configured to emit tracing data]({{<ref setup-tracing.md>}}), and because Dapr does so using widely adopted protocols such as the [Zipkin](https://zipkin.io) protocol, it can be easily integrated with multiple [monitoring backends]({{<ref supported-tracing-backends>}}).
Dapr can be [configured to emit tracing data]({{<ref setup-tracing.md>}}), and because Dapr does so using the widely adopted protocols of [Open Telemetry (OTEL)](https://opentelemetry.io/) and [Zipkin](https://zipkin.io), it can be easily integrated with multiple observability tools.

<img src="/images/observability-tracing.png" width=1000 alt="Distributed tracing with Dapr">

### OpenTelemetry collector
Dapr can also be configured to work with the [OpenTelemetry Collector]({{<ref open-telemetry-collector>}}) which offers even more compatibility with external monitoring tools.
### Automatic tracing context generation
Dapr uses [W3C tracing]({{<ref w3c-tracing-overview>}}) specification for tracing context, included as part Open Telemetry (OTEL), to generate and propagate the context header for the application or propagate user-provided context headers. This means that you get tracing by default with Dapr.

<img src="/images/observability-opentelemetry-collector.png" width=1000 alt="Distributed tracing via OpenTelemetry collector">

### Tracing context
Dapr uses [W3C tracing]({{<ref w3c-tracing>}}) specification for tracing context and can generate and propagate the context header itself or propagate user-provided context headers.

## Observability for the Dapr sidecar and system services
As for other parts of your system, you will want to be able to observe Dapr itself and collect metrics and logs emitted by the Dapr sidecar that runs along each microservice, as well as the Dapr-related services in your environment such as the control plane services that are deployed for a Dapr-enabled Kubernetes cluster.
## Observability for the Dapr sidecar and control plane
You also want to be able to observe Dapr itself, by collecting metrics on performance, throughput and latency and logs emitted by the Dapr sidecar, as well as the Dapr control plane services. Dapr sidecars have a health endpoint that can be probed to indicate their health status.

<img src="/images/observability-sidecar.png" width=1000 alt="Dapr sidecar metrics, logs and health checks">

### Logging
Dapr generates [logs]({{<ref "logs.md">}}) to provide visibility into sidecar operation and to help users identify issues and perform debugging. Log events contain warning, error, info, and debug messages produced by Dapr system services. Dapr can also be configured to send logs to collectors such as [Fluentd]({{< ref fluentd.md >}}) and [Azure Monitor]({{< ref azure-monitor.md >}}) so they can be easily searched, analyzed and provide insights.
Dapr generates [logs]({{<ref "logs.md">}}) to provide visibility into sidecar operation and to help users identify issues and perform debugging. Log events contain warning, error, info, and debug messages produced by Dapr system services. Dapr can also be configured to send logs to collectors such as [Fluentd]({{< ref fluentd.md >}}) and [Azure Monitor]({{< ref azure-monitor.md >}}) and others observability tools so they can be searched, analyzed and provide insights.

### Metrics
Metrics are the series of measured values and counts that are collected and stored over time. [Dapr metrics]({{<ref "metrics">}}) provide monitoring capabilities to understand the behavior of the Dapr sidecar and system services. For example, the metrics between a Dapr sidecar and the user application show call latency, traffic failures, error rates of requests, etc. Dapr [system services metrics](https://github.com/dapr/dapr/blob/master/docs/development/dapr-metrics.md) show sidecar injection failures and the health of system services, including CPU usage, number of actor placements made, etc.
Metrics are the series of measured values and counts that are collected and stored over time. [Dapr metrics]({{<ref "metrics">}}) provide monitoring capabilities to understand the behavior of the Dapr sidecar and control plane. For example, the metrics between a Dapr sidecar and the user application show call latency, traffic failures, error rates of requests, etc. Dapr [control plane metrics](https://github.com/dapr/dapr/blob/master/docs/development/dapr-metrics.md) show sidecar injection failures and the health of control plane services, including CPU usage, number of actor placements made, etc.

### Health checks
The Dapr sidecar exposes an HTTP endpoint for [health checks]({{<ref sidecar-health.md>}}). With this API, user code or hosting environments can probe the Dapr sidecar to determine its status and identify issues with sidecar readiness.
Original file line number Diff line number Diff line change
Expand Up @@ -3,53 +3,113 @@ type: docs
title: "Distributed tracing"
linkTitle: "Distributed tracing"
weight: 1000
description: "Use Dapr tracing to get visibility for distributed application"
description: "Use tracing to get visibility into your application"
---

Dapr uses the Zipkin protocol for distributed traces and metrics collection. Due to the ubiquity of the Zipkin protocol, many backends are supported out of the box, for examples [Stackdriver](https://cloud.google.com/stackdriver), [Zipkin](https://zipkin.io), [New Relic](https://newrelic.com) and others. Combining with the OpenTelemetry Collector, Dapr can export traces to many other backends including but not limted to [Azure Monitor](https://azure.microsoft.com/services/monitor/), [Datadog](https://www.datadoghq.com), Instana, [Jaeger](https://www.jaegertracing.io/), and [SignalFX](https://www.signalfx.com/).
Dapr uses the Open Telemetry (OTEL) and Zipkin protocols for distributed traces. OTEL is the industry standard and is the recommended trace protocol to use.

<img src="/images/tracing.png" width=600>
Most observability tools support OTEL. For example [Google Cloud Operations](https://cloud.google.com/products/operations), [New Relic](https://newrelic.com), [Azure Monitor](https://azure.microsoft.com/services/monitor/), [Datadog](https://www.datadoghq.com), Instana, [Jaeger](https://www.jaegertracing.io/), and [SignalFX](https://www.signalfx.com/).

## Tracing design
## Scenarios
Tracing is used with service invocaton and pub/sub APIs. You can flow trace context between services that uses these APIs.

Dapr adds a HTTP/gRPC middleware to the Dapr sidecar. The middleware intercepts all Dapr and application traffic and automatically injects correlation IDs to trace distributed transactions. This design has several benefits:
There are two scenarios for how tracing is used:
1. Dapr generates the trace context and you propagate the trace context to another service.
2. You generate the trace context and Dapr propagates the trace context to a service.

* No need for code instrumentation. All traffic is automatically traced with configurable tracing levels.
* Consistent tracing behavior across microservices. Tracing is configured and managed on Dapr sidecar so that it remains consistent across services made by different teams and potentially written in different programming languages.
* Configurable and extensible. By leveraging the Zipkin API and the OpenTelemetry Collector, Dapr tracing can be configured to work with popular tracing backends, including custom backends a customer may have.
* You can define and enable multiple exporters at the same time.
### Propogating sequential service calls
Dapr takes care of creating the trace headers. However, when there are more than two services, you're responsible for propagating the trace headers between them. Let's go through the scenarios with examples:

## W3C Correlation ID
1. Single service invocation call (`service A -> service B`)

Dapr uses the standard W3C Trace Context headers. For HTTP requests, Dapr uses `traceparent` header. For gRPC requests, Dapr uses `grpc-trace-bin` header. When a request arrives without a trace ID, Dapr creates a new one. Otherwise, it passes the trace ID along the call chain.
Dapr generates the trace headers in service A, which are then propagated from service A to service B. No further propagation is needed.

Read [W3C distributed tracing]({{< ref w3c-tracing >}}) for more background on W3C Trace Context.
2. Multiple sequential service invocation calls ( `service A -> service B -> service C`)

## Configuration
Dapr generates the trace headers at the beginning of the request in service A, which are then propagated to service B. You are now responsible for taking the headers and propagating them to service C, since this is specific to your application.

`service A -> service B -> propagate trace headers to -> service C` and so on to further Dapr-enabled services.

Dapr uses probabilistic sampling. The sample rate defines the probability a tracing span will be sampled and can have a value between 0 and 1 (inclusive). The default sample rate is 0.0001 (i.e. 1 in 10,000 spans is sampled).
In other words, if the app is calling to Dapr and wants to trace with an existing span (trace header), it must always propagate to Dapr (from service B to service C in this case). Dapr always propagates trace spans to an application.

To change the default tracing behavior, use a configuration file (in self hosted mode) or a Kubernetes configuration object (in Kubernetes mode). For example, the following configuration object changes the sample rate to 1 (i.e. every span is sampled), and sends trace using Zipkin protocol to the Zipkin server at http://zipkin.default.svc.cluster.local
{{% alert title="Note" color="primary" %}}
There are no helper methods exposed in Dapr SDKs to propagate and retrieve trace context. You need to use HTTP/gRPC clients to propagate and retrieve trace headers through HTTP headers and gRPC metadata.
{{% /alert %}}

```yaml
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: tracing
namespace: default
spec:
tracing:
samplingRate: "1"
zipkin:
endpointAddress: "http://zipkin.default.svc.cluster.local:9411/api/v2/spans"
```
3. Request is from external endpoint (for example, `from a gateway service to a Dapr-enabled service A`)

Note: Changing `samplingRate` to 0 disables tracing altogether.
An external gateway ingress calls Dapr, which generates the trace headers and calls service A. Service A then calls service B and further Dapr-enabled services. You must propagate the headers from service A to service B: `Ingress -> service A -> propagate trace headers -> service B`. This is similar to case 2 above.

See the [References](#references) section for more details on how to configure tracing on local environment and Kubernetes environment.
4. Pub/sub messages
Dapr generates the trace headers in the published message topic. These trace headers are propagated to any services listening on that topic.

## References
### Propogating multiple different service calls
In the following scenarios, Dapr does some of the work for you and you need to either create or propagate trace headers.

- [How-To: Setup Application Insights for distributed tracing with OpenTelemetry Collector]({{< ref open-telemetry-collector.md >}})
- [How-To: Set up Zipkin for distributed tracing]({{< ref zipkin.md >}})
- [W3C distributed tracing]({{< ref w3c-tracing >}})
1. Multiple service calls to different services from single service

When you are calling multiple services from a single service (see example below), you need to propagate the trace headers:

```
service A -> service B
[ .. some code logic ..]
service A -> service C
[ .. some code logic ..]
service A -> service D
[ .. some code logic ..]
```

In this case, when service A first calls service B, Dapr generates the trace headers in service A, which are then propagated to service B. These trace headers are returned in the response from service B as part of response headers. You then need to propagate the returned trace context to the next services, service C and service D, as Dapr does not know you want to reuse the same header.

### Generating your own trace context headers from non-Daprized applications

You may have chosen to generate your own trace context headers.
Generating your own trace context headers is more unusual and typically not required when calling Dapr. However, there are scenarios where you could specifically choose to add W3C trace headers into a service call; for example, you have an existing application that does not use Dapr. In this case, Dapr still propagates the trace context headers for you. If you decide to generate trace headers yourself, there are three ways this can be done:

1. You can use the industry standard [OpenTelemetry SDKs](https://opentelemetry.io/docs/instrumentation/) to generate trace headers and pass these trace headers to a Dapr-enabled service. This is the preferred method.

2. You can use a vendor SDK that provides a way to generate W3C trace headers and pass them to a Dapr-enabled service.

3. You can handcraft a trace context following [W3C trace context specifications](https://www.w3.org/TR/trace-context/) and pass them to a Dapr-enabled service.

## W3C trace context

Dapr uses the standard W3C trace context headers.

- For HTTP requests, Dapr uses `traceparent` header.
- For gRPC requests, Dapr uses `grpc-trace-bin` header.

When a request arrives without a trace ID, Dapr creates a new one. Otherwise, it passes the trace ID along the call chain.

Read [trace context overview]({{< ref w3c-tracing-overview >}}) for more background on W3C trace context.

## W3C trace headers
These are the specific trace context headers that are generated and propagated by Dapr for HTTP and gRPC.

### Trace context HTTP headers format
When propagating a trace context header from an HTTP response to an HTTP request, you copy these headers.

#### Traceparent header
The traceparent header represents the incoming request in a tracing system in a common format, understood by all vendors.
Here’s an example of a traceparent header.

`traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01`

Find the traceparent fields detailed [here](https://www.w3.org/TR/trace-context/#traceparent-header).

#### Tracestate header
The tracestate header includes the parent in a potentially vendor-specific format:

`tracestate: congo=t61rcWkgMzE`

Find the tracestate fields detailed [here](https://www.w3.org/TR/trace-context/#tracestate-header).

### Trace context gRPC headers format
In the gRPC API calls, trace context is passed through `grpc-trace-bin` header.

## Related Links

- [Observability concepts]({{< ref observability-concept.md >}})
- [W3C Trace Context for distributed tracing]({{< ref w3c-tracing-overview >}})
- [W3C Trace Context specification](https://www.w3.org/TR/trace-context/)
- [Observability quickstart](https://github.com/dapr/quickstarts/tree/master/tutorials/observability)
Loading

0 comments on commit 4d860db

Please sign in to comment.