S3 streaming #61

mrk-its · 2019-11-15T00:56:28Z

It allows for streaming directly from / to S3 without creating temporary files. Additionally files opened for reading are seekable (for example it allows for reading contents of zip files directly from S3).

Drawbacks:

append mode is not supported (but it makes little sense on S3)

The code is heavy inspired by streaming code from smart_open. It is also on MIT license, I'm including main author in file's preamble.

Fixes #47

mrk-its · 2019-11-15T00:57:58Z

cc: @willmcgugan, @gilbsgilbs

willmcgugan · 2019-11-17T12:01:46Z

@mrk-its When you say append isn't supported, what happens if you were to open a file in append mode?

mrk-its · 2019-11-17T12:14:57Z

@mrk-its When you say append isn't supported, what happens if you were to open a file in append mode?
ResourceError is raised:

        if _mode.appending:
            raise errors.ResourceError(path, msg="append mode is not supported")

willmcgugan · 2019-11-17T12:20:45Z

Sorry, that's not going to fly. It would make S3FS a special-case and not a drop in replacement for the other filesystems. If append can't be done efficiently we would have to drop back to the download/append/upload which is at least no-worse that it is now.

mrk-its · 2019-11-17T12:25:03Z

Sorry, that's not going to fly. It would make S3FS a special-case and not a drop in replacement for the other filesystems. If append can't be done efficiently we would have to drop back to the download/append/upload which is at least no-worse that it is now.

Ok, I'll implement it that way (I'm pretty sure it may be done with streaming of reading, without downloading whole file)

mrk-its · 2019-11-17T16:54:02Z

@willmcgugan
I've just implemented append mode (with copying read stream to output stream - it seems to work). I've also ran integration tests and unfortunately found number of other problems that can't be easy fixed:

"rw" modes ("r+", "w+", "a+") do not work - and there is no way to make them work in new implementation (we can of course fall back to current implementation using temporary files, but it is far from ideal)
"w" mode is not seekable - it is also expected in new implementation and there is no easy fix (we cannot even do fallback to "temporary files" here)

If these small incompatibilities are not acceptable please close this PR, I'll maintain my fork with streaming support (but I'm pretty sure that possibility to streaming big input / output files is a way more important than support for rarely used "rw" modes or "w" mode seeking)

davidparks21 · 2019-11-17T18:47:11Z

You have my vote on streaming. We house very large files on S3 and a common use case is to read small bits like headers from them. I do this in boto3 when performance matters, but for small scripts it's annoying to double code for filesystem and S3, here PyFilesystem hits its stride. But this is only an option if I can read the header without pulling down a reasonable fraction of a terabyte of data.

If this can't be made to work directly in PyFilesystem, @mrk-its you'll be my hero for maintaining the fork. :) :)

It seems reasonable to allow some limitations/incompatibilities when the user explicitly sets strict=False, and therefore acknowledges and expects the inherent trade-offs.

mrk-its · 2019-11-17T19:12:07Z

I also really want to have streaming directly in PyFilesystem, but to have it we must accept some limitations of underlying filesystems (for example lack of some opening modes or so). I think it will be recurring problem for other filesystem implementations too - ability for opening file in "rw" mode or seeking in write mode is very strong requirement

davidparks21 · 2019-11-18T03:57:22Z

I agree, I had a similar problem with TarFS previously. When walking a .tar file, in order to be strictly compliant, TarFS has to double scan the archive to guard against duplicate files (a legal situation in a tar file, but not in PyFilesystem). But this double scan performance hit made it unusable on anything but trivially small files. I had to back off of using PyFilesystem in that case.

Perhaps standardizing something like strict=False across all filesystems would be a good thing. That way, when performance is impacted by compliance for a given filesystem, the filesystem has an option to back off some requirements in exchange for the performance benefit, and that can be documented and agreed to by the user explicitly (by setting strict=False).

For example, if TarFS had a strict=False option then it would be possible to document that walking a Tar file with a single scan of the file could produce the same filename twice, but then would only require a single scan of the tar file (a huge benefit on large files). Seems like the same kind of issue is cropping up here.

The performance issues in both cases, TarFS and S3FS are holding me back from putting PyFilesystem to more than just experimental use.

CMCDragonkai · 2020-04-13T15:32:07Z

There are filesystems that will often fail on certain operations that it doesn't support. One common example is ZFS.

So it should be fine to raise exceptions for things the underlying fs doesn't support.

Mode rw seems like you would have to fall back onto a temporary file. Or a very complicated networked memory mapping system using multi-part uploads and ranged/seeked downloads.

S3 streaming

3696a90

mrk-its mentioned this pull request Nov 17, 2019

I can open an S3 file in root, but not in a subdirectory #62

Closed

do not start multipart upload for small writes

695b879

mrk-its force-pushed the s3_streaming branch from 4dce8f8 to 695b879 Compare November 17, 2019 14:45

readd support for append

2c919fe

mrk-its force-pushed the s3_streaming branch from efe11ee to 2c919fe Compare November 17, 2019 15:52

closed property on S3InputFile

9311492

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 streaming #61

S3 streaming #61

mrk-its commented Nov 15, 2019

mrk-its commented Nov 15, 2019 •

edited

Loading

willmcgugan commented Nov 17, 2019

mrk-its commented Nov 17, 2019

willmcgugan commented Nov 17, 2019

mrk-its commented Nov 17, 2019

mrk-its commented Nov 17, 2019

davidparks21 commented Nov 17, 2019 •

edited

Loading

mrk-its commented Nov 17, 2019 •

edited

Loading

davidparks21 commented Nov 18, 2019 •

edited

Loading

CMCDragonkai commented Apr 13, 2020

S3 streaming #61

Are you sure you want to change the base?

S3 streaming #61

Conversation

mrk-its commented Nov 15, 2019

mrk-its commented Nov 15, 2019 • edited Loading

willmcgugan commented Nov 17, 2019

mrk-its commented Nov 17, 2019

willmcgugan commented Nov 17, 2019

mrk-its commented Nov 17, 2019

mrk-its commented Nov 17, 2019

davidparks21 commented Nov 17, 2019 • edited Loading

mrk-its commented Nov 17, 2019 • edited Loading

davidparks21 commented Nov 18, 2019 • edited Loading

CMCDragonkai commented Apr 13, 2020

mrk-its commented Nov 15, 2019 •

edited

Loading

davidparks21 commented Nov 17, 2019 •

edited

Loading

mrk-its commented Nov 17, 2019 •

edited

Loading

davidparks21 commented Nov 18, 2019 •

edited

Loading