You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background: I'm processing sorted parquet files. I don't need or want to load the entire file, but would like to read in chunks forward, so as to limit the impact on memory. This is a benefit of working with sorted data. But I'm stymied in this, and wr.s3.read_parquet
regularly overruns memory for my AWS lambda.
Maybe there's a workaround (if so can anyone share?) but I'm stymied a bit by the fact that the pyarrow
library in aws-sdk-pandas is compiled without S3 support. The reason is I want to use pyarrow natively
to read files in a more finer-grained fashion (i.e. like, one row group at a time) instead of the wrangler routine,
which seems to load an entire file at once (and apparently reading in numeric chunks isn't better on memory).
So, if the newer releases can compile pyarrow with S3 support, and do that into the future, that would be useful to people like me who need to access more primitive operations than just loading an entire file.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Background: I'm processing sorted parquet files. I don't need or want to load the entire file, but would like to read in chunks forward, so as to limit the impact on memory. This is a benefit of working with sorted data. But I'm stymied in this, and wr.s3.read_parquet
regularly overruns memory for my AWS lambda.
Maybe there's a workaround (if so can anyone share?) but I'm stymied a bit by the fact that the pyarrow
library in aws-sdk-pandas is compiled without S3 support. The reason is I want to use pyarrow natively
to read files in a more finer-grained fashion (i.e. like, one row group at a time) instead of the wrangler routine,
which seems to load an entire file at once (and apparently reading in numeric chunks isn't better on memory).
So, if the newer releases can compile pyarrow with S3 support, and do that into the future, that would be useful to people like me who need to access more primitive operations than just loading an entire file.
Beta Was this translation helpful? Give feedback.
All reactions