Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/iceberg): Several improvements to iceberg connector #12744

Merged
merged 12 commits into from
Mar 1, 2025

Conversation

skrydal
Copy link
Collaborator

@skrydal skrydal commented Feb 27, 2025

  • Extend datasetProperties information with creation and last modified times (if available)
  • Handle FileIO errors (class not found) as warnings - extracted from ValueError exception, in long term we should probably change pyiceberg library to throw unique exception type in such case
  • Timings reporting classes were not thread-safe, introduced locks
  • Extend datasetProperties with qualifiedName (set as namespace.table)

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 27, 2025
Copy link

codecov bot commented Feb 27, 2025

Codecov Report

Attention: Patch coverage is 93.02326% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...on/src/datahub/ingestion/source/iceberg/iceberg.py 91.66% 2 Missing ⚠️
...datahub/ingestion/source/iceberg/iceberg_common.py 94.73% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 27, 2025
)
return
def _try_processing_dataset(
dataset_path: Tuple[str, ...], dataset_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple[str, ...] seems equivalent to List[str]

I guess only difference is the tuple ensures a size >= 1 and that's relevant here, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more differences between a tuple and a list and this annotation reflects that. For one, tuple is immutable and passed by value. Moreover this annotation matches what we are getting from pyiceberg, so overall I would like to keep it like this.

except ValueError as e:
if "Could not initialize FileIO" not in str(e):
raise
self.report.warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may you have forgot the redundant LOGGER.warning for this except case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning method will automatically log, unlike report_warning. I should change all lines to use warning I think, see:

def report_warning(
self,
message: LiteralString,
context: Optional[str] = None,
title: Optional[LiteralString] = None,
exc: Optional[BaseException] = None,
) -> None:
self._structured_logs.report_log(
StructuredLogLevel.WARN, message, title, context, exc, log=False
)
def warning(
self,
message: LiteralString,
context: Optional[str] = None,
title: Optional[LiteralString] = None,
exc: Optional[BaseException] = None,
) -> None:
self._structured_logs.report_log(
StructuredLogLevel.WARN, message, title, context, exc, log=True
)

Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Adding locks feels repetitive code.
Not a big fan of decorators, but they may be helpful here.

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Feb 28, 2025
@skrydal
Copy link
Collaborator Author

skrydal commented Feb 28, 2025

LGTM

Adding locks feels repetitive code. Not a big fan of decorators, but they may be helpful here.

Considering they are singe lines with lock... I think decorators will take same amount of lines (beside lock initialization), right? Do you have any particular decorator in your mind which I could use?

@sgomezvillamor
Copy link
Contributor

Do you have any particular decorator in your mind which I could use?

Unexpectedly, I haven't found any! 😅

@skrydal skrydal merged commit 3e1b20c into datahub-project:master Mar 1, 2025
77 checks passed
@skrydal skrydal deleted the ps_improve_iceberg_connector branch March 1, 2025 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants