Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): ensure zip directory store compares key to prefix correctly #2758

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/zarr/storage/_zip.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ async def list_dir(self, prefix: str) -> AsyncIterator[str]:
yield key
else:
for key in keys:
if key.startswith(prefix + "/") and key != prefix:
if key.startswith(prefix + "/") and key.strip("/") != prefix:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a TON of this kind of thing already in this codebase. We need these things defined in the IO layer generally, not in individual implementations, as they all face similar problems.

Similarly, do we actually need a ZIP implementation when fsspec can do this for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there have been a LOT of these string parsing bugs. it's a weak point in the codebase, and we definitely need something more solid!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec knows how you feel.

We can come up with a small number of string normalising functions that live on the Store baseclass, something like os.path, and normalise all user-passed strings at the first opportunity. This is what fsspec's _strip_protocol attempts, but also has problems.

def norm_path(s):
    # this probably adds not insignificant runtime cost
    return re.sub("/+", "/", s.lstrip("/").rstrip("/"))

Copy link
Contributor Author

@ilan-gold ilan-gold Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a weak point in the codebase

I did wonder how widespread this could be. I noticed this sort of thing in other stores but wasn't sure if there was knowledge that they were kosher (for whatever reason) within the maintainers (and therefore didn't need this sort of fix).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration with backend IO libraries (fsspec and probably objstore) is particularly thorny, since they do their own path munging!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we didn't represent prefixes as strings, but as tuples of strings instead? e.g. instead of /foo/bar/baz we would just have ('', 'foo', 'bar','baz'), which would be very easy to distinguish from ('foo', 'bar', 'baz'). Lets ignore for a second the profound disruption this would have on our store APIs.

k = key.removeprefix(prefix + "/").split("/")[0]
if k not in seen:
seen.add(k)
Expand Down
Loading