Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

xBis7
Copy link

@xBis7 xBis7 commented Nov 12, 2024

This PR is a draft because this is my first time contributing to Airflow and I'm not sure if there should be an AIP or a mailing discussion for such changes. I'd appreciate any feedback.

Issue description

related: #40802

There are some OpenTelemetry standard practices that help keep the usage consistent across multiple projects. According to those

  • when a root span starts, the trace id and the span id are generated randomly
  • while the span is active, the context is captured
    • injected into a carrier object which is a map or a python dictionary
  • the carrier with the captured context is propagated across services and used to create sub spans
  • the new sub-span extracts the parent context from the carrier and uses it to set the parent
    • the trace id is accessed from the carrier and the span id is generated randomly

To explain more, this is what the flow of operations should be

  1. Dag run starts - start dagrun span
    i. Get the dagrun span context
  2. Start task - with the dagrun context, start ti span
    i. Get the ti span context
  3. With the ti span context create task-sub span
  4. Task finishes - ti span ends
  5. Dag run finishes - dagrun span ends

Airflow follows a different approach

  • trace and span ids are generated in a deterministic way from various properties of the dag_run and the task_instance
  • the spans for a dag and its tasks, are generated after the run has finished

This is the flow of the current implementation

  1. Dag run starts
  2. Tasks start
  3. Tasks finish
  4. Create spans for the tasks
  5. Dag run finishes
  6. Create span for the dag run

The current approach makes it impossible to create spans from under tasks while using the existing airflow code. To achieve that, you need to use https://github.com/howardyoo/airflow_otel_provider which has to generate the same trace id and span id as airflow otherwise the spans won't be properly associated. This isn't easily maintainable and it also makes it hard for people that are familiar with otel but new to airflow, to start using the feature.

These are some references to OpenTelemetry docs

https://opentelemetry.io/docs/concepts/context-propagation/
https://opentelemetry.io/docs/languages/python/propagation/
https://www.w3.org/TR/trace-context/

Implementation description

A lot of people might already be using airflow with the existing otel implementation. To avoid any inconvenience, the changes are hidden behind a config flag.

This patch is extending the existing implementation and not changing it. Once the flag is turned on, a new set of spans gets generated and exported. The new spans have the suffix _ctx_prop on their name.

image

I've reused the attributes from the original implementation. In addition, the timings are the same.

  • without context propagation
    image
  • with context propagation
    image

To be able to propagate the context of a span, the span must be active.

For example, for a dag run, the span can't be created at the end but

  • it needs to start with the run
  • stay active so that the context can be captured and propagated
  • end once the run also ends

Same goes for a task and any sub-spans.

With this approach, we can use the new otel methods for creating spans directly from under a task without the need of the airflow_otel_provider. These spans will be children of the task span.

Check test_otel_dag.py for an example of usage.

Testing

I've added a unit test for the otel_tracer methods and an integration test that runs a dag and then asserts the parent child relationships for the dag span, the task spans and the task sub spans.

I've also tested the changes manually with a PythonVirtualenvOperator without issues.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:CLI area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:serialization kind:documentation labels Nov 12, 2024
Copy link

boring-cyborg bot commented Nov 12, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@ashb
Copy link
Member

ashb commented Nov 12, 2024

I'm confirming, but otel traces might have been experimental still, and if that's the case we are free to change them for a good reason, and your description certainly sounds like one!

@ferruzzi
Copy link
Contributor

ferruzzi commented Nov 12, 2024

If I may, can you add some indication to the title that this is related to OTel Traces specifically? We also have OTel Metrics implemented and it would be nice to minimize confusion.

@xBis7 xBis7 changed the title Provide an alternative OpenTelemetry implementation that follows standard otel practices Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices Nov 13, 2024
@xBis7
Copy link
Author

xBis7 commented Nov 13, 2024

@ferruzzi I adjusted the title. The only change in this patch that is related to metrics, it's https://github.com/apache/airflow/pull/43941/files#diff-1cca954ec0be1aaf2c212e718c004cb0902a96ac60043bf0c97a782dee52cc32R85-R86

If you think that it's out of scope, then I can remove it.

@@ -330,6 +337,30 @@ def trigger_tasks(self, open_slots: int) -> None:
for _ in range(min((open_slots, len(self.queued_tasks)))):
key, (command, _, queue, ti) = sorted_queue.pop(0)

if self.otel_use_context_propagation:
# If it's None, then the span for the current TaskInstanceKey hasn't been started.
if self.active_spans.get(key) is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been chatting to @xBis7 about this, and we're not sure that this active_spans idea will work with Airflow's HA schedulers. We are exploring options.

Comment on lines +85 to +86
# According to otel spec, max length should be 255. Change if the spec gets revised.
OTEL_NAME_MAX_LENGTH = 255
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this for now. @ArshiaZr has a PR already going that is making this fix.

FWIW, long story short, that was there because when I initially implemented OTel Metrics there was a bug on their end that threw an exception if the name was longer than that, but it appears to have been fixed at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:serialization kind:documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants