-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fs] support hfs.ls on a bucket #14176
Conversation
Teaches `hfs.ls('gs://bucket/')` to list the files and directories at the top-level of the bucket. In `main` that command raises because this line of `_ls_no_glob` raises: ```python3 maybe_sb_and_t, maybe_contents = await asyncio.gather( self._size_bytes_and_time_modified_or_none(path), ls_as_dir() ) ``` In particular, `statfile` raises a cloud-specific, esoteric error about a malformed URL or empty object names: ```python3 async def _size_bytes_and_time_modified_or_none(self, path: str) -> Optional[Tuple[int, float]]: try: # Hadoop semantics: creation time is used if the object has no notion of last modification time. file_status = await self.afs.statfile(path) return (await file_status.size(), file_status.time_modified().timestamp()) except FileNotFoundError: return None ``` I decided to add a sub-class of `FileNotFoundError` which is self-describing: `IsABucketError`. I changed most methods to raise that error when given a bucket URL. The two interesting cases: 1. `isdir`. This raises an error but I could also see this returning `True`. A bucket is like a directory whose path/name is empty. 2. `isfile`. This returns False but I could also see this raising an error. This just seems convenient, we know the bucket is not a file so we should say so. --- Apparently `hfs.ls` had no current tests because the globbing system doesn't work with Azure https:// URLs. I fixed it to use `AsyncFSURL.with_new_path_component` which is resilient to Azure https weirdness. However, I had to change `with_new_path_component` to treat an empty path in a special way. I wanted this to hold: ``` actual = str(afs.parse_url('gs://bucket').with_new_path_component('bar')) expected = 'gs://bucket/bar' assert actual == expected ``` But `with_new_path_component` interacts badly with `GoogleAsyncFSURL.__str__` to return this: ``` 'gs://bucket//bar' ```
da37480
to
2ace87b
Compare
@@ -490,6 +499,8 @@ async def create(self, url: str, *, retry_writes: bool = True) -> S3CreateManage | |||
# complete before the write can begin (unlike the current | |||
# code, that copies 128MB parts in 256KB chunks). | |||
bucket, name = self.get_bucket_and_name(url) | |||
if name == '': | |||
raise IsABucketError(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think I'd rather have this handled with a decorator (or at least a separate method). Is it bad DX to have these methods accept a S3AsyncFSURL
and have a decorator on top tha converts str
to the the URL object, optionally erroring if there's no path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my main gripe with decorators is that they tend to mangle the types. I tried writing a decorator with precise types in this commit. I'm not sure why this fails:
pyright fs.py
/Users/dking/projects/hail/hail/python/hailtop/aiocloud/aioaws/fs.py
/Users/dking/projects/hail/hail/python/hailtop/aiocloud/aioaws/fs.py:460:15 - error: "open" overrides method of same name in class "AsyncFS" with incompatible type "StrOrURLMethod[(), Coroutine[Any, Any, ReadableStream]]" (reportIncompatibleMethodOverride)
1 error, 0 warnings, 0 informations
It works fine though. I can read stuff from S3.
In [1]: from hailtop.aiocloud.aioaws import S3AsyncFS
...: fs = S3AsyncFS()
...: (await (await fs.open('s3://gnomad-public-us-east-1/release/4.0/constraint/README.txt')).read())[:100]
Out[1]: b'\xef\xbb\xbfConstraint field descriptions\r\n\r\nDescriptions of columns in the gnomAD v4 constraint metrics tsv.'
Maybe I'm overcomplicating what you're asking, but there seems to be two issues:
- If the URL is in the open method's signature either we make FS generic (ugh) or we assert that the URL is the right URL everywhere.
- Preserving precise type information requires a fair bit of code (even if we can resolve (1)).
I was able to write a decorator that had no types and pyright didn't complain, but I also lost all type hinting on fs.open
. IMO, between types and not repeating myself, I'd go with types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try something using inheritance instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something along these lines may work and won't be too bad, but we'll need to play the same trick where the RouterAsyncFSURL has a reference to its file system. It will be a rather large change and I (1) would rather not tie that to this PR and (2) have committed to limiting my programming substantially in this last month. What do you think of leaving this as a follow up cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrmph, ya while I don't hate the last option in principle I'd rather not get that involved here. Can you just wrap the error throwing in a method so that most of these call-sites become one-liners and we can punt on the refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daniel-goldstein OK, this was quite a few lines of code change, so worth a careful re-review.
return | ||
|
||
bucket_url = str(base.with_path('')) | ||
fs = RouterFS() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this resource needs to be closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made RouterFS a context manager and used that here. I also added an exit stack to LocalBackend. I also changed ServiceBackend to use its exitstack the same way as LocalBackend.
played some type golf trying to figure out what would work. see comments, I think there's a workable solution but it's a large change to the FS stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed comment in thread. Let's just use a subroutine for now
Teaches
hfs.ls('gs://bucket/')
to list the files and directories at the top-level of the bucket.In
main
that command raises because this line of_ls_no_glob
raises:In particular,
statfile
raises a cloud-specific, esoteric error about a malformed URL or empty object names:I decided to add a sub-class of
FileNotFoundError
which is self-describing:IsABucketError
.I changed most methods to raise that error when given a bucket URL. The two interesting cases:
isdir
. This raises an error but I could also see this returningTrue
. A bucket is like a directory whose path/name is empty.isfile
. This returns False but I could also see this raising an error. This just seems convenient, we know the bucket is not a file so we should say so.Apparently
hfs.ls
had no current tests because the globbing system doesn't work with Azure https:// URLs. I fixed it to useAsyncFSURL.with_new_path_component
which is resilient to Azure https weirdness. However, I had to changewith_new_path_component
to treat an empty path in a special way. I wanted this to hold:But
with_new_path_component
interacts badly withGoogleAsyncFSURL.__str__
to return this: