pyarrow compiled without S3 support #2759

skh1000 · 2024-04-05T13:10:35Z

skh1000
Apr 5, 2024

Background: I'm processing sorted parquet files. I don't need or want to load the entire file, but would like to read in chunks forward, so as to limit the impact on memory. This is a benefit of working with sorted data. But I'm stymied in this, and wr.s3.read_parquet
regularly overruns memory for my AWS lambda.

Maybe there's a workaround (if so can anyone share?) but I'm stymied a bit by the fact that the pyarrow
library in aws-sdk-pandas is compiled without S3 support. The reason is I want to use pyarrow natively
to read files in a more finer-grained fashion (i.e. like, one row group at a time) instead of the wrangler routine,
which seems to load an entire file at once (and apparently reading in numeric chunks isn't better on memory).
So, if the newer releases can compile pyarrow with S3 support, and do that into the future, that would be useful to people like me who need to access more primitive operations than just loading an entire file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pyarrow compiled without S3 support #2759

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

pyarrow compiled without S3 support #2759

Uh oh!

skh1000 Apr 5, 2024

Replies: 0 comments

skh1000
Apr 5, 2024