Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/sql-common): add column level lineage for external tables #11997

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

acrylJonny
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@acrylJonny acrylJonny marked this pull request as draft December 2, 2024 10:29
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Dec 2, 2024
@acrylJonny acrylJonny changed the title feat(ingestion/sql-common): column level lineage for external tables feat(ingestion/sql-common): add column level lineage for external tables Dec 2, 2024
@acrylJonny acrylJonny marked this pull request as ready for review December 2, 2024 13:36
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 2, 2024
@anshbansal anshbansal removed the community-contribution PR or Issue raised by member(s) of DataHub Community label Dec 4, 2024
if self.ctx.graph:
upstream_schema_metadata: Optional[
SchemaMetadata
] = self.ctx.graph.get_schema_metadata(upstream_dataset_urn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if fetching the upstream schema from the DataHub graph is an existing pattern in the connectors, so I'm asking:

  • what if the upstream schema isn't published yet? should we report a warning?
  • Isn't the connector itself producing the upstream schema? if so, may we skip fetching it from the DataHub graph?
  • may we overloading the DataHub graph with too many requests, if we call it hundreds/thousands of times?

Let me know your thoughts when you get a chance!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should avoid making graph calls like this if we can avoid it.

@acrylJonny what was the reason we need this lookup?

Copy link
Contributor Author

@acrylJonny acrylJonny Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially it was to ensure that the upstream dataset has the same column as the downstream. We could choose to skip this validation though and simply link the upstream with the downstream blindly. This would remove the need to obtain the schema of the upstream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep let's just do the blind linking for now - it's what we do for most other sources that generate external lineage

The main risk there is actually column casing mismatches, but for those we'd need to use a fuzzy matcher - we recently added this to looker, but imo it's probably not necessary here

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Dec 17, 2024

The SqlParsingAggregator has a add_known_lineage_mapping that generates CLL based on the schema of the downstream. Ideally we'd centralize on using that as the external lineage mechanism.

Long term, I want to move sql_common.py to use the SqlParsingAggregator instead of the older SqlParsingBuilder. Internal ticket tracking that - https://linear.app/acryl-data/issue/ING-779/refactor-move-sql-common-to-use-sqlparsingaggregator

In the short term, I'm ok with having this CLL generation logic, although all the complexity of the simplify_field_path logic worries me a bit on this PR.

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 18, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Dec 30, 2024

Now that #12220 has been merged, we can make this implementation be a bit cleaner.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants