Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941

xBis7 · 2024-11-12T18:12:58Z

This PR is a draft because this is my first time contributing to Airflow and I'm not sure if there should be an AIP or a mailing discussion for such changes. I'd appreciate any feedback.

Issue description

related: #40802

There are some OpenTelemetry standard practices that help keep the usage consistent across multiple projects. According to those

when a root span starts, the trace id and the span id are generated randomly
while the span is active, the context is captured
- injected into a carrier object which is a map or a python dictionary
the carrier with the captured context is propagated across services and used to create sub spans
the new sub-span extracts the parent context from the carrier and uses it to set the parent
- the trace id is accessed from the carrier and the span id is generated randomly

To explain more, this is what the flow of operations should be

Dag run starts - start dagrun span
i. Get the dagrun span context
Start task - with the dagrun context, start ti span
i. Get the ti span context
With the ti span context create task-sub span
Task finishes - ti span ends
Dag run finishes - dagrun span ends

Airflow follows a different approach

trace and span ids are generated in a deterministic way from various properties of the dag_run and the task_instance
the spans for a dag and its tasks, are generated after the run has finished

This is the flow of the current implementation

Dag run starts
Tasks start
Tasks finish
Create spans for the tasks
Dag run finishes
Create span for the dag run

The current approach makes it impossible to create spans from under tasks while using the existing airflow code. To achieve that, you need to use https://github.com/howardyoo/airflow_otel_provider which has to generate the same trace id and span id as airflow otherwise the spans won't be properly associated. This isn't easily maintainable and it also makes it hard for people that are familiar with otel but new to airflow, to start using the feature.

These are some references to OpenTelemetry docs

https://opentelemetry.io/docs/concepts/context-propagation/
https://opentelemetry.io/docs/languages/python/propagation/
https://www.w3.org/TR/trace-context/

Implementation description

A lot of people might already be using airflow with the existing otel implementation. To avoid any inconvenience, the changes are hidden behind a config flag.

This patch is extending the existing implementation and not changing it. Once the flag is turned on, a new set of spans gets generated and exported. The new spans have the suffix _ctx_prop on their name.

I've reused the attributes from the original implementation. In addition, the timings are the same.

without context propagation
with context propagation

To be able to propagate the context of a span, the span must be active.

For example, for a dag run, the span can't be created at the end but

it needs to start with the run
stay active so that the context can be captured and propagated
end once the run also ends

Same goes for a task and any sub-spans.

With this approach, we can use the new otel methods for creating spans directly from under a task without the need of the airflow_otel_provider. These spans will be children of the task span.

Check test_otel_dag.py for an example of usage.

Testing

I've added a unit test for the otel_tracer methods and an integration test that runs a dag and then asserts the parent child relationships for the dag span, the task spans and the task sub spans.

I've also tested the changes manually with a PythonVirtualenvOperator without issues.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

boring-cyborg · 2024-11-12T18:13:03Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: [email protected]
Slack: https://s.apache.org/airflow-slack

ashb · 2024-11-12T19:22:20Z

I'm confirming, but otel traces might have been experimental still, and if that's the case we are free to change them for a good reason, and your description certainly sounds like one!

ferruzzi · 2024-11-12T20:21:47Z

If I may, can you add some indication to the title that this is related to OTel Traces specifically? We also have OTel Metrics implemented and it would be nice to minimize confusion.

xBis7 · 2024-11-13T08:50:08Z

@ferruzzi I adjusted the title. The only change in this patch that is related to metrics, it's https://github.com/apache/airflow/pull/43941/files#diff-1cca954ec0be1aaf2c212e718c004cb0902a96ac60043bf0c97a782dee52cc32R85-R86

If you think that it's out of scope, then I can remove it.

ashb · 2024-11-13T12:56:14Z

airflow/executors/base_executor.py

@@ -330,6 +337,30 @@ def trigger_tasks(self, open_slots: int) -> None:
        for _ in range(min((open_slots, len(self.queued_tasks)))):
            key, (command, _, queue, ti) = sorted_queue.pop(0)

+            if self.otel_use_context_propagation:
+                # If it's None, then the span for the current TaskInstanceKey hasn't been started.
+                if self.active_spans.get(key) is None:


I've been chatting to @xBis7 about this, and we're not sure that this active_spans idea will work with Airflow's HA schedulers. We are exploring options.

ferruzzi · 2024-11-13T17:14:04Z

airflow/metrics/validators.py

+# According to otel spec, max length should be 255. Change if the spec gets revised.
+OTEL_NAME_MAX_LENGTH = 255


You can drop this for now. @ArshiaZr has a PR already going that is making this fix.

FWIW, long story short, that was there because when I initially implemented OTel Metrics there was a bug on their end that threw an exception if the name was longer than that, but it appears to have been fixed at some point.

xBis7 added 2 commits November 12, 2024 18:02

trace and span creation with context propagation

e1f721d

cleanup

fdb92df

boring-cyborg bot added area:CLI area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:serialization kind:documentation labels Nov 12, 2024

xBis7 changed the title ~~Provide an alternative OpenTelemetry implementation that follows standard otel practices~~ Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices Nov 13, 2024

ashb reviewed Nov 13, 2024

View reviewed changes

ferruzzi reviewed Nov 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941

Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941

xBis7 commented Nov 12, 2024 •

edited

Loading

boring-cyborg bot commented Nov 12, 2024

ashb commented Nov 12, 2024

ferruzzi commented Nov 12, 2024 •

edited

Loading

xBis7 commented Nov 13, 2024

ashb Nov 13, 2024

ferruzzi Nov 13, 2024

		# According to otel spec, max length should be 255. Change if the spec gets revised.
		OTEL_NAME_MAX_LENGTH = 255

Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941

Are you sure you want to change the base?

Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941

Conversation

xBis7 commented Nov 12, 2024 • edited Loading

Issue description

Implementation description

Testing

boring-cyborg bot commented Nov 12, 2024

ashb commented Nov 12, 2024

ferruzzi commented Nov 12, 2024 • edited Loading

xBis7 commented Nov 13, 2024

ashb Nov 13, 2024

Choose a reason for hiding this comment

ferruzzi Nov 13, 2024

Choose a reason for hiding this comment

xBis7 commented Nov 12, 2024 •

edited

Loading

ferruzzi commented Nov 12, 2024 •

edited

Loading