-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fix): ensure zip directory store compares key to prefix correctly #2758
base: main
Are you sure you want to change the base?
Conversation
@@ -271,7 +271,7 @@ async def list_dir(self, prefix: str) -> AsyncIterator[str]: | |||
yield key | |||
else: | |||
for key in keys: | |||
if key.startswith(prefix + "/") and key != prefix: | |||
if key.startswith(prefix + "/") and key.strip("/") != prefix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a TON of this kind of thing already in this codebase. We need these things defined in the IO layer generally, not in individual implementations, as they all face similar problems.
Similarly, do we actually need a ZIP implementation when fsspec can do this for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there have been a LOT of these string parsing bugs. it's a weak point in the codebase, and we definitely need something more solid!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsspec knows how you feel.
We can come up with a small number of string normalising functions that live on the Store baseclass, something like os.path
, and normalise all user-passed strings at the first opportunity. This is what fsspec's _strip_protocol attempts, but also has problems.
def norm_path(s):
# this probably adds not insignificant runtime cost
return re.sub("/+", "/", s.lstrip("/").rstrip("/"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a weak point in the codebase
I did wonder how widespread this could be. I noticed this sort of thing in other stores but wasn't sure if there was knowledge that they were kosher (for whatever reason) within the maintainers (and therefore didn't need this sort of fix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration with backend IO libraries (fsspec and probably objstore) is particularly thorny, since they do their own path munging!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we didn't represent prefixes as strings, but as tuples of strings instead? e.g. instead of /foo/bar/baz
we would just have ('', 'foo', 'bar','baz')
, which would be very easy to distinguish from ('foo', 'bar', 'baz')
. Lets ignore for a second the profound disruption this would have on our store APIs.
can you add a test that would have failed on |
@d-v-b if you're ok with me uploading a fixture that has no provenance, then yes (or whose provenance is "outside" zarr). Otherwise, I haven't figured out how to get the leading slash using pure-zarr APIs |
you don't need to upload anything -- modify one of these test functions to generate a condition that triggers the error. |
(you can, of course, use builtin zipfile to generate a possibly problematic object - that can be done in memory or tmpdir) |
in the interest of speed, you could follow @martindurant's advice and add a test case that creates a problematic zip instance to the zip-store tests: https://github.com/zarr-developers/zarr-python/blob/main/tests/test_store/test_zip.py, we could then figure out how to generalize this to the base class if we feel the need. |
That would suffice for me, just wasn't sure if this was kosher |
Done! |
Fixes #2757
Since there was code there previously, I presume it was somehow possible to hit the condition although I haven't figured out how (outside of the reproducer file that I have).
Here is how I would in theory go about it:
But it only yields the subkey
[faz]
instead of['bar', 'faz']
as in the linked issue/file.UPDATE: I think the file was created by simply zipping an old zarr store, but I can't be certain. In any case I checked using
unzip -v /Users/ilangold/Projects/Theis/anndata/tests/data/archives/v0.7.0/adata.zarr.zip
and saw that folders are indeed listed there so I do think this is a real possibility.TODO:
docs/user-guide/*.rst
changes/