Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 streaming #61

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

S3 streaming #61

wants to merge 4 commits into from

Conversation

mrk-its
Copy link

@mrk-its mrk-its commented Nov 15, 2019

It allows for streaming directly from / to S3 without creating temporary files. Additionally files opened for reading are seekable (for example it allows for reading contents of zip files directly from S3).

Drawbacks:

  • append mode is not supported (but it makes little sense on S3)

The code is heavy inspired by streaming code from smart_open. It is also on MIT license, I'm including main author in file's preamble.

Fixes #47

@mrk-its
Copy link
Author

mrk-its commented Nov 15, 2019

cc: @willmcgugan, @gilbsgilbs

@willmcgugan
Copy link
Member

@mrk-its When you say append isn't supported, what happens if you were to open a file in append mode?

@mrk-its
Copy link
Author

mrk-its commented Nov 17, 2019

@mrk-its When you say append isn't supported, what happens if you were to open a file in append mode?
ResourceError is raised:

        if _mode.appending:
            raise errors.ResourceError(path, msg="append mode is not supported")

@willmcgugan
Copy link
Member

Sorry, that's not going to fly. It would make S3FS a special-case and not a drop in replacement for the other filesystems. If append can't be done efficiently we would have to drop back to the download/append/upload which is at least no-worse that it is now.

@mrk-its
Copy link
Author

mrk-its commented Nov 17, 2019

Sorry, that's not going to fly. It would make S3FS a special-case and not a drop in replacement for the other filesystems. If append can't be done efficiently we would have to drop back to the download/append/upload which is at least no-worse that it is now.

Ok, I'll implement it that way (I'm pretty sure it may be done with streaming of reading, without downloading whole file)

@mrk-its
Copy link
Author

mrk-its commented Nov 17, 2019

@willmcgugan
I've just implemented append mode (with copying read stream to output stream - it seems to work). I've also ran integration tests and unfortunately found number of other problems that can't be easy fixed:

  • "rw" modes ("r+", "w+", "a+") do not work - and there is no way to make them work in new implementation (we can of course fall back to current implementation using temporary files, but it is far from ideal)
  • "w" mode is not seekable - it is also expected in new implementation and there is no easy fix (we cannot even do fallback to "temporary files" here)

If these small incompatibilities are not acceptable please close this PR, I'll maintain my fork with streaming support (but I'm pretty sure that possibility to streaming big input / output files is a way more important than support for rarely used "rw" modes or "w" mode seeking)

@davidparks21
Copy link

davidparks21 commented Nov 17, 2019

You have my vote on streaming. We house very large files on S3 and a common use case is to read small bits like headers from them. I do this in boto3 when performance matters, but for small scripts it's annoying to double code for filesystem and S3, here PyFilesystem hits its stride. But this is only an option if I can read the header without pulling down a reasonable fraction of a terabyte of data.

If this can't be made to work directly in PyFilesystem, @mrk-its you'll be my hero for maintaining the fork. :) :)

It seems reasonable to allow some limitations/incompatibilities when the user explicitly sets strict=False, and therefore acknowledges and expects the inherent trade-offs.

@mrk-its
Copy link
Author

mrk-its commented Nov 17, 2019

I also really want to have streaming directly in PyFilesystem, but to have it we must accept some limitations of underlying filesystems (for example lack of some opening modes or so). I think it will be recurring problem for other filesystem implementations too - ability for opening file in "rw" mode or seeking in write mode is very strong requirement

@davidparks21
Copy link

davidparks21 commented Nov 18, 2019

I agree, I had a similar problem with TarFS previously. When walking a .tar file, in order to be strictly compliant, TarFS has to double scan the archive to guard against duplicate files (a legal situation in a tar file, but not in PyFilesystem). But this double scan performance hit made it unusable on anything but trivially small files. I had to back off of using PyFilesystem in that case.

Perhaps standardizing something like strict=False across all filesystems would be a good thing. That way, when performance is impacted by compliance for a given filesystem, the filesystem has an option to back off some requirements in exchange for the performance benefit, and that can be documented and agreed to by the user explicitly (by setting strict=False).

For example, if TarFS had a strict=False option then it would be possible to document that walking a Tar file with a single scan of the file could produce the same filename twice, but then would only require a single scan of the tar file (a huge benefit on large files). Seems like the same kind of issue is cropping up here.

The performance issues in both cases, TarFS and S3FS are holding me back from putting PyFilesystem to more than just experimental use.

@CMCDragonkai
Copy link

There are filesystems that will often fail on certain operations that it doesn't support. One common example is ZFS.

So it should be fine to raise exceptions for things the underlying fs doesn't support.

Mode rw seems like you would have to fall back onto a temporary file. Or a very complicated networked memory mapping system using multi-part uploads and ranged/seeked downloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use multipart upload instead of a temporary file
4 participants