-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 streaming #61
base: master
Are you sure you want to change the base?
S3 streaming #61
Conversation
cc: @willmcgugan, @gilbsgilbs |
@mrk-its When you say append isn't supported, what happens if you were to open a file in append mode? |
|
Sorry, that's not going to fly. It would make S3FS a special-case and not a drop in replacement for the other filesystems. If append can't be done efficiently we would have to drop back to the download/append/upload which is at least no-worse that it is now. |
Ok, I'll implement it that way (I'm pretty sure it may be done with streaming of reading, without downloading whole file) |
4dce8f8
to
695b879
Compare
efe11ee
to
2c919fe
Compare
@willmcgugan
If these small incompatibilities are not acceptable please close this PR, I'll maintain my fork with streaming support (but I'm pretty sure that possibility to streaming big input / output files is a way more important than support for rarely used "rw" modes or "w" mode seeking) |
You have my vote on streaming. We house very large files on S3 and a common use case is to read small bits like headers from them. I do this in boto3 when performance matters, but for small scripts it's annoying to double code for filesystem and S3, here PyFilesystem hits its stride. But this is only an option if I can read the header without pulling down a reasonable fraction of a terabyte of data. If this can't be made to work directly in PyFilesystem, @mrk-its you'll be my hero for maintaining the fork. :) :) It seems reasonable to allow some limitations/incompatibilities when the user explicitly sets |
I also really want to have streaming directly in PyFilesystem, but to have it we must accept some limitations of underlying filesystems (for example lack of some opening modes or so). I think it will be recurring problem for other filesystem implementations too - ability for opening file in "rw" mode or seeking in write mode is very strong requirement |
I agree, I had a similar problem with TarFS previously. When walking a .tar file, in order to be strictly compliant, TarFS has to double scan the archive to guard against duplicate files (a legal situation in a tar file, but not in PyFilesystem). But this double scan performance hit made it unusable on anything but trivially small files. I had to back off of using PyFilesystem in that case. Perhaps standardizing something like For example, if TarFS had a The performance issues in both cases, TarFS and S3FS are holding me back from putting PyFilesystem to more than just experimental use. |
There are filesystems that will often fail on certain operations that it doesn't support. One common example is ZFS. So it should be fine to raise exceptions for things the underlying fs doesn't support. Mode rw seems like you would have to fall back onto a temporary file. Or a very complicated networked memory mapping system using multi-part uploads and ranged/seeked downloads. |
It allows for streaming directly from / to S3 without creating temporary files. Additionally files opened for reading are seekable (for example it allows for reading contents of zip files directly from S3).
Drawbacks:
The code is heavy inspired by streaming code from smart_open. It is also on MIT license, I'm including main author in file's preamble.
Fixes #47