Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Aliases #1119

Open
Tracked by #1192
pankajkoti opened this issue Jul 24, 2024 · 1 comment · May be fixed by #1217
Open
Tracked by #1192

Dataset Aliases #1119

pankajkoti opened this issue Jul 24, 2024 · 1 comment · May be fixed by #1217
Assignees
Labels
Milestone

Comments

@pankajkoti
Copy link
Contributor

pankajkoti commented Jul 24, 2024

Description co-authored by @tatiana @pankajastro

Since Cosmos 1.1, it creates Airflow inlets and outlets for every dbt model/seed/snapshot task, which allows end-users to leverage Airflow Data-aware scheduling.

In the past, Cosmos had identified these inlets and outlets using URIs that were not representative of the dataset being created. The one advantage with this approach is that the identifiers could be created during DAG parsing/processing time.

This changed in the 1.1 release, when we decided to adopt the OpenLineage naming convention to describe Airflow Datasets created by Cosmos (inlets/outlets). They became something similar to: "postgres://0.0.0.0:5432/postgres.public.stg_customers". The downside with this approach was: we started using a library openlineage-integration-common that can only create the resources URIs after the dbt command was run, since it currently relies on dbt-core artefacts. This means we started creating inlets/outlets during task execution.

A side-effect of this change was that Airflow <= 2.9 was not designed to support setting inlets and outlets during task execution, which resulted in this long-standing issue:
#522

Another side effect was that, since we started relying on task execution to determine the Airflow dataset identifier, we didn't expose end-users to a method for easily determining it. More context on #1036.

The community very often raises that.

We created an issue in Airflow:
apache/airflow#34206

After several discussions with @uranusjr, he proposed introducing the concept of DatasetAliases to Airflow 2.10. @Lee-W worked on this: apache/airflow#40478

This feature will be released as part of Airlfow 2.10.

The goal of this epic is to leverage Airflow DatasetAliasses in Cosmos, so that:

  • users can clearly see datasets created by Cosmos, during task execution, in the Airflow UI
  • We can have non-OpenLineage Dataset Aliases that can be added during DAG parsing time. We can expose methods for users than to be able to retrieve these.

Initially planned tasks, more to be added as part of the PoC ticket:

@tatiana
Copy link
Collaborator

tatiana commented Sep 24, 2024

I made significant progress on this task, as can be seen in PR #1217.

Yesterday, I implemented the changes to the code itself (no tests, just a quick PoC).
Today, I validated and made a minor adjustment to make it work.

The change works as expected in Astro CLI. Using Airflow standalone doesn't work so well. I connected with Wei about this and he'll further investigate.

I was able to see the Datasets/Datasets Alias in the Airflow UI.

I was also able to see a DAG being triggered. I'll soon share more information on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants