-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingestion/sql-common): add column level lineage for external tables #11997
base: master
Are you sure you want to change the base?
Conversation
if self.ctx.graph: | ||
upstream_schema_metadata: Optional[ | ||
SchemaMetadata | ||
] = self.ctx.graph.get_schema_metadata(upstream_dataset_urn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if fetching the upstream schema from the DataHub graph is an existing pattern in the connectors, so I'm asking:
- what if the upstream schema isn't published yet? should we report a warning?
- Isn't the connector itself producing the upstream schema? if so, may we skip fetching it from the DataHub graph?
- may we overloading the DataHub graph with too many requests, if we call it hundreds/thousands of times?
Let me know your thoughts when you get a chance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we should avoid making graph calls like this if we can avoid it.
@acrylJonny what was the reason we need this lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially it was to ensure that the upstream dataset has the same column as the downstream. We could choose to skip this validation though and simply link the upstream with the downstream blindly. This would remove the need to obtain the schema of the upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep let's just do the blind linking for now - it's what we do for most other sources that generate external lineage
The main risk there is actually column casing mismatches, but for those we'd need to use a fuzzy matcher - we recently added this to looker, but imo it's probably not necessary here
The SqlParsingAggregator has a Long term, I want to move sql_common.py to use the SqlParsingAggregator instead of the older SqlParsingBuilder. Internal ticket tracking that - https://linear.app/acryl-data/issue/ING-779/refactor-move-sql-common-to-use-sqlparsingaggregator In the short term, I'm ok with having this CLL generation logic, although all the complexity of the |
Now that #12220 has been merged, we can make this implementation be a bit cleaner. |
Checklist