feat(ingestion/sql-common): add column level lineage for external tables #11997

acrylJonny · 2024-12-02T10:29:48Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

sgomezvillamor · 2024-12-17T12:20:35Z

metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py

+        if self.ctx.graph:
+            upstream_schema_metadata: Optional[
+                SchemaMetadata
+            ] = self.ctx.graph.get_schema_metadata(upstream_dataset_urn)


Not sure if fetching the upstream schema from the DataHub graph is an existing pattern in the connectors, so I'm asking:

what if the upstream schema isn't published yet? should we report a warning?

Isn't the connector itself producing the upstream schema? if so, may we skip fetching it from the DataHub graph?

may we overloading the DataHub graph with too many requests, if we call it hundreds/thousands of times?

Let me know your thoughts when you get a chance!

Agree that we should avoid making graph calls like this if we can avoid it.

@acrylJonny what was the reason we need this lookup?

Essentially it was to ensure that the upstream dataset has the same column as the downstream. We could choose to skip this validation though and simply link the upstream with the downstream blindly. This would remove the need to obtain the schema of the upstream.

Yep let's just do the blind linking for now - it's what we do for most other sources that generate external lineage

The main risk there is actually column casing mismatches, but for those we'd need to use a fuzzy matcher - we recently added this to looker, but imo it's probably not necessary here

hsheth2 · 2024-12-17T17:29:34Z

The SqlParsingAggregator has a add_known_lineage_mapping that generates CLL based on the schema of the downstream. Ideally we'd centralize on using that as the external lineage mechanism.

Long term, I want to move sql_common.py to use the SqlParsingAggregator instead of the older SqlParsingBuilder. Internal ticket tracking that - https://linear.app/acryl-data/issue/ING-779/refactor-move-sql-common-to-use-sqlparsingaggregator

In the short term, I'm ok with having this CLL generation logic, although all the complexity of the simplify_field_path logic worries me a bit on this PR.

…to athena-cll

hsheth2 · 2024-12-30T19:14:14Z

Now that #12220 has been merged, we can make this implementation be a bit cleaner.

cll dev

2003e71

acrylJonny marked this pull request as draft December 2, 2024 10:29

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Dec 2, 2024

acrylJonny and others added 3 commits December 2, 2024 10:33

Merge branch 'master' into athena-cll

a5539fd

Update sql_common.py

e28004b

Update sql_common.py

34c60f3

vercel bot deployed to Preview December 2, 2024 11:07 View deployment

acrylJonny added 2 commits December 2, 2024 11:24

Update sql_common.py

de85aff

Update sql_common.py

5078a3f

vercel bot deployed to Preview December 2, 2024 11:59 View deployment

acrylJonny added 2 commits December 2, 2024 12:37

Update sql_common.py

d1360c9

Update sql_common.py

fb1409d

vercel bot deployed to Preview December 2, 2024 13:10 View deployment

Update sql_common.py

e2f6b26

acrylJonny changed the title ~~feat(ingestion/sql-common): column level lineage for external tables~~ feat(ingestion/sql-common): add column level lineage for external tables Dec 2, 2024

acrylJonny marked this pull request as ready for review December 2, 2024 13:36

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 2, 2024

vercel bot deployed to Preview December 2, 2024 14:09 View deployment

acrylJonny added 3 commits December 2, 2024 14:47

Update sql_common.py

0f25c53

Update sql_common.py

8839ce0

Update sql_common.py

847cce9

vercel bot deployed to Preview December 2, 2024 15:37 View deployment

Update sql_common.py

5ef2404

vercel bot deployed to Preview December 2, 2024 16:34 View deployment

Update sql_common.py

b6cb0ec

vercel bot deployed to Preview December 2, 2024 17:08 View deployment

Merge branch 'master' into athena-cll

1516fde

vercel bot deployed to Preview December 3, 2024 09:04 View deployment

anshbansal removed the community-contribution PR or Issue raised by member(s) of DataHub Community label Dec 4, 2024

sgomezvillamor reviewed Dec 17, 2024

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024

acrylJonny and others added 3 commits December 18, 2024 09:06

Merge branch 'master' into athena-cll

1c8f49d

Update sql_common.py

c210a61

Merge branch 'athena-cll' of https://github.com/acrylJonny/datahub in…

ce6a8ac

…to athena-cll

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 18, 2024

vercel bot deployed to Preview December 18, 2024 09:58 View deployment

Merge branch 'master' into athena-cll

b4b921d

vercel bot deployed to Preview December 18, 2024 16:09 View deployment

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion/sql-common): add column level lineage for external tables #11997

feat(ingestion/sql-common): add column level lineage for external tables #11997

acrylJonny commented Dec 2, 2024

sgomezvillamor Dec 17, 2024

hsheth2 Dec 17, 2024

acrylJonny Dec 17, 2024 •

edited

Loading

hsheth2 Dec 17, 2024

hsheth2 commented Dec 17, 2024

hsheth2 commented Dec 30, 2024

feat(ingestion/sql-common): add column level lineage for external tables #11997

Are you sure you want to change the base?

feat(ingestion/sql-common): add column level lineage for external tables #11997

Conversation

acrylJonny commented Dec 2, 2024

Checklist

sgomezvillamor Dec 17, 2024

Choose a reason for hiding this comment

hsheth2 Dec 17, 2024

Choose a reason for hiding this comment

acrylJonny Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

hsheth2 Dec 17, 2024

Choose a reason for hiding this comment

hsheth2 commented Dec 17, 2024

hsheth2 commented Dec 30, 2024

acrylJonny Dec 17, 2024 •

edited

Loading