Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/s3): add table filtering #12661

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

eagle-25
Copy link
Contributor

@eagle-25 eagle-25 commented Feb 18, 2025

Description

  1. s3_source trying ingest all folders in path_spec.include and this noisily triggers the warning messages.
path_spec.include: s3://bucket/path/{table}/{partition0}

s3 files
1. s3://bucket/shop_log/year=2025/month=01/day=01/0000.csv  # error
2. s3://bucket/visit_log/year=2025/month=01/day=02/0000.csv  # error
3. s3://bucket/user_log/2025/0000.csv

==== start s3 ingestion

--- noisy logs---

INFO - Processing folder: datahub-test/shop_log
INFO - Getting files from folder: s3://bucket/datahub-test/shop_log/year=2025/
WARNING - Unable to find any files in the folder s3://bucket/datahub-test/shop_log/year=2025/. Skipping...

INFO - Processing folder: datahub-test/visit_log
INFO - Getting files from folder: s3://bucket/datahub-test/visit_log/year=2025/
WARNING - Unable to find any files in the folder s3://bucket/datahub-test/visit_log/year=2025/. Skipping...

--- meaningful log ---

INFO - Processing folder: datahub-test/user_log
INFO - Getting files from folder: s3://bucket/datahub-test/user_log/2025/
INFO - Extracting table schema from file: s3://bucket/datahub-test/user_log/2025/ai_kpi.04_kst_v4_dag_runs.csv
  1. It is not able to ingest some specific tables from a path. Though we have the ignore_patterns property, it is not a proper way because it will be wordy.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 18, 2025
Copy link

codecov bot commented Feb 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
...hub/ingestion/source/data_lake_common/path_spec.py 85.03% <100.00%> (+0.75%) ⬆️
...ngestion/src/datahub/ingestion/source/s3/source.py 86.66% <100.00%> (+0.13%) ⬆️

... and 37 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f14c42d...22869f4. Read the comment docs.

@eagle-25 eagle-25 force-pushed the feat/s3/ingest/add-allowed-tables-prop branch 2 times, most recently from 5037d74 to d4637d6 Compare February 18, 2025 05:49
@eagle-25 eagle-25 force-pushed the feat/s3/ingest/add-allowed-tables-prop branch from d4637d6 to 9371c27 Compare February 20, 2025 15:35
@eagle-25 eagle-25 marked this pull request as ready for review February 20, 2025 15:35
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 20, 2025
@eagle-25 eagle-25 force-pushed the feat/s3/ingest/add-allowed-tables-prop branch from 9371c27 to 9f91d21 Compare February 24, 2025 15:05
@@ -145,6 +145,11 @@ class Config:
description="Include hidden folders in the traversal (folders starting with . or _",
)

allowed_tables: Optional[List[str]] = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other sources usually handle this filtering with the allow/deny patterns. So this should be:

    tables_pattern: AllowDenyPattern = Field(

You can find many examples in the code https://github.com/search?q=repo%3Adatahub-project%2Fdatahub+AllowDenyPattern+language%3APython&type=code&l=Python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! Could you check again?

Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on the new config parameter, so it is aligned with other sources

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Feb 25, 2025
@eagle-25 eagle-25 force-pushed the feat/s3/ingest/add-allowed-tables-prop branch from 9f91d21 to ed27f0b Compare February 25, 2025 14:57
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Feb 25, 2025
@eagle-25 eagle-25 force-pushed the feat/s3/ingest/add-allowed-tables-prop branch 4 times, most recently from 52a1c37 to 10d7da5 Compare February 25, 2025 15:45
@eagle-25
Copy link
Contributor Author

eagle-25 commented Feb 25, 2025

Changes

  • remove allowed_tables and is_allowed_table() from PathSpec
  • add tables_pattern: AllowDenyPattern to PathSpec
  • use tables_pattern.allowed() to filter the tables
  • add allow_tables and deny_tables to s3 integration test
  • add unit tests for tables_pattern.allowed() to PathSpec().allowed() and PathSpec().allowed_dir()

- add tables_pattern to path_spec
- add table filtering to S3Source().s3_browser()
@eagle-25 eagle-25 force-pushed the feat/s3/ingest/add-allowed-tables-prop branch from e9ab564 to 024d22b Compare February 25, 2025 16:37
@eagle-25 eagle-25 force-pushed the feat/s3/ingest/add-allowed-tables-prop branch from 024d22b to 22869f4 Compare February 26, 2025 01:17
@eagle-25
Copy link
Contributor Author

eagle-25 commented Feb 26, 2025

The CI (Python Build / Python Wheels) seems broken, and merging this PR would fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants