-
Notifications
You must be signed in to change notification settings - Fork 117
read_storage() - filter by name pattern #1285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideAdds optional pattern-based file filtering to read_storage by introducing a helper that builds glob filters, updates the function API and documentation, and accompanies the feature with comprehensive unit tests. Class diagram for updated read_storage and _apply_pattern_filteringclassDiagram
class DataChain {
+filter(expr)
}
class C {
+glob(pattern)
}
class _apply_pattern_filtering {
+_apply_pattern_filtering(chain: DataChain, pattern: Union[str, list[str]], column: str) DataChain
}
class read_storage {
+read_storage(uri, ..., pattern=None, ...) DataChain
}
read_storage --> _apply_pattern_filtering : uses
_apply_pattern_filtering --> DataChain : filters
_apply_pattern_filtering --> C : builds filter expressions
Flow diagram for pattern-based filtering in read_storageflowchart TD
A[User calls read_storage with pattern argument] --> B[read_storage processes storage_chain]
B --> C[_apply_pattern_filtering applies glob filters]
C --> D[Filtered DataChain returned]
B -->|No pattern| D
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dmpetrov - I've reviewed your changes - here's some feedback:
- Make sure you import typing.Union and the C helper for column construction in storage.py so that _apply_pattern_filtering compiles without errors.
- Consider applying the pattern filter to each individual chain before merging multiple URIs to reduce intermediate result sizes and improve efficiency.
- The glob logic always prepends a wildcard, which can lead to double-wildcard patterns when the user already includes '*'; consider normalizing or documenting this behavior to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Make sure you import typing.Union and the C helper for column construction in storage.py so that _apply_pattern_filtering compiles without errors.
- Consider applying the pattern filter to each individual chain before merging multiple URIs to reduce intermediate result sizes and improve efficiency.
- The glob logic always prepends a wildcard, which can lead to double-wildcard patterns when the user already includes '*'; consider normalizing or documenting this behavior to avoid confusion.
## Individual Comments
### Comment 1
<location> `src/datachain/lib/dc/storage.py:43` </location>
<code_context>
+ pattern_list = pattern if isinstance(pattern, list) else [pattern]
+ filters = []
+
+ for pattern_item in pattern_list:
+ filters.append(C(f"{column}.path").glob(f"*{pattern_item}"))
+
+ combined_filter = filters[0]
</code_context>
<issue_to_address>
Pattern matching may be too permissive for certain use cases.
Consider whether using `*{pattern_item}` could unintentionally match more files than intended, especially if user-supplied patterns already include wildcards.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
for pattern_item in pattern_list:
filters.append(C(f"{column}.path").glob(f"*{pattern_item}"))
=======
for pattern_item in pattern_list:
# If the pattern already contains glob wildcards, use as-is
if any(char in pattern_item for char in ["*", "?", "["]):
glob_pattern = pattern_item
else:
glob_pattern = f"*{pattern_item}*"
filters.append(C(f"{column}.path").glob(glob_pattern))
>>>>>>> REPLACE
</suggested_fix>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
for pattern_item in pattern_list: | ||
filters.append(C(f"{column}.path").glob(f"*{pattern_item}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Pattern matching may be too permissive for certain use cases.
Consider whether using *{pattern_item}
could unintentionally match more files than intended, especially if user-supplied patterns already include wildcards.
for pattern_item in pattern_list: | |
filters.append(C(f"{column}.path").glob(f"*{pattern_item}")) | |
for pattern_item in pattern_list: | |
# If the pattern already contains glob wildcards, use as-is | |
if any(char in pattern_item for char in ["*", "?", "["]): | |
glob_pattern = pattern_item | |
else: | |
glob_pattern = f"*{pattern_item}*" | |
filters.append(C(f"{column}.path").glob(glob_pattern)) |
filters = [] | ||
|
||
for pattern_item in pattern_list: | ||
filters.append(C(f"{column}.path").glob(f"*{pattern_item}")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Convert for loop into list comprehension (list-comprehension
)
filters = [] | |
for pattern_item in pattern_list: | |
filters.append(C(f"{column}.path").glob(f"*{pattern_item}")) | |
filters = [ | |
C(f"{column}.path").glob(f"*{pattern_item}") | |
for pattern_item in pattern_list | |
] |
Deploying datachain-documentation with
|
Latest commit: |
5798e9a
|
Status: | ✅ Deploy successful! |
Preview URL: | https://1eea906f.datachain-documentation.pages.dev |
Branch Preview URL: | https://feature-pattern-based-filter.datachain-documentation.pages.dev |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1285 +/- ##
=======================================
Coverage 88.79% 88.80%
=======================================
Files 153 153
Lines 14055 14067 +12
Branches 1980 1983 +3
=======================================
+ Hits 12480 12492 +12
Misses 1113 1113
Partials 462 462
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
NOT READY YET because I don't like pattern=.. approach. Figuring our if the pattern could be fit into storage string.
Closes #1283 pattern only:
dc.read_storage("s3://bkt/dir", pattern="*.jpg")
ordc.read_storage("s3://bkt/dir", pattern=["*.jpg", "*.png"])
Summary by Sourcery
Add support for file name pattern filtering to read_storage
New Features:
Enhancements:
Documentation:
Tests: