-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python/S3 connection pooling #14572
Comments
I've managed to run some benchmarks now. Here's what I did: I have a load of parquet files on S3 that range from ~2kb to ~2mb depending on the size of the shard. I made a list of 100 random files (the same each time, so there'll be some optimisations on S3 between runs I imaging). These files are listed in This was all run from my computer in the same region as the S3 bucket. I imaging all speeds would be faster on AWS (Lambda or EC2), but I'm also assuming that the relative speeds would be the same.
With method 2, I'd assumed Polars might download less data, but it looks like it's doing more or less the same as downloading all the data. Maybe I've misunderstood how this works, or I've made some other mistake. Either way, I think this shows the performance improvements that could be possible with my use case, with a connection pool. |
It might be related to this issue #14384 and this comment: #14384 (comment) |
Thanks for the speedy response! I've build the version with 0c86901 and run my tests again, This time I used a EC2 instance (to speed up build times on my slow laptop). The good news is that things are faster. Before the change However, Boto3 with a global client is still out performing Polars by a long way. The baseline for reading from boto and then passing this file on to Polars is about 2.5 seconds. I don't know if this is becaue boto3 is doing something that I think this issue might need re-opening if the latter is true and this is somethingin the control of Polars. If it's not, then I expect the existing change is going to be about all this project can do. I'll leave that up to someone who knows more about it than I do! Thanks again for the quick fix! |
Could you try getting Polars out of the equation and try the benchmark immediately with object-store? Curious what the baseline differences are. |
I don't know Rust very well, but I took some time hacking something together. A release version of a simple script that loops over the exact same files as my Python script actually completes slightly faster than Boto3.
As I say, the new Polars version is much better than before the change, but there's something going on that's not related to object_store, I believe. Script outlinelet s3_store = AmazonS3Builder::from_env().with_bucket_name(bucket_name).build().unwrap();
let files_to_download = vec![...]
let start = Instant::now(); // Start timing
for file_path in files_to_download {
let object_key = format!("prefix/{}", file_path);
let object_path = Path::from(object_key);
let data = s3_store.get(&object_path).await?.bytes().await?;
println!("Downloaded {} bytes for {}", data.len(), file_path);
}
let duration = start.elapsed(); // End timing
println!("Total: {:?} milliseconds", duration.as_millis());
Ok(()) Of course in my Python scripts I'm also reading the parquet into Polars, but I doubt that's adding a lot of overhead. |
Alright interesting. Have you got some info on the Polars query you run? Polars downloads all row groups speratedly (concurrently). And in the case of projections in downloads the columns concurently. How many row groups have you got per file? If you create files with a single row-group does that improve? (i am on vacation so have limited internet. Cannot do cloud tests). |
For this test I'm literally just calling I can have a go at making smaller files. The largest file contains about 80k rows, most are about 25k rows. Sorry I don't know enough about the workings of Parquet...is a row-group the same as a row?! There are 8 columns in each file, witha fair amound of duplication of column values across rows. |
You can check this function to read the number of rowgroups. It is a sharding mechanism for parquet format, designed so that parquet readers can batch their data ingestion by this unit. |
Ok thanks. I'm at the limits of my current understanding of parquet, but I think a row group might relate to the number of actual files data is split across e.g when using hive partitions? Either way, I sampled a loads og my files including the ones in the tests and they are all 1 row group. I've manually partitioned my data according to the domain I'm working in, so I know that when I'm using I did try using hive partitions (before the patch related to this issue was in) but the processing time was very slow, I assume partly because of the (lack of) connection pool. I've not tried with the patch, but as the single file downloads quickly enough using boto3 for my needs I don't think changing the sharding approach is related to the issue here. |
I appreciate this issue is closed, but for anyone else with this problem who might find this issue in future: As things currently stand it's about twice as fast to fetch files from S3 using My current understanding is that this is caused by re-establishing a connection to S3 for each read rather than using a single connection for all reads. At the moment, if you want to read multiple files using the same connection, it's better to pass a file-like object to I appreciate the help from the maintainers in this thread 👍 |
Description
I've looked at the issues, docs and asked this question on SO and I think this is a missing-but-useful feature.
As per my SO question, I am using Polars on AWS Lambda. I care about response times (Polars helps a lot with this, thanks!), but I believe there is currently no way to cache/pool connections to AWS between invocations.
A typical pattern with Boto3 is to define the client outside of the AWS handler function, meaning that a new connection is only created for a cold-start. I think that Polars will make a new connection each time
read_parquet
(etc) is called.I think when calling S3 this will lead to some authentication overheard for each request that I think could be avoided if I could pass in a connection object that's pooled / in a global scope.
I've not yet benchmarked this, but it's possible that with small parquet files it might be faster to request the file with a pooled
boto3
connection and read the data from boto into polars.Ideally I'd be able to manage everything in Polars somehow.
The text was updated successfully, but these errors were encountered: