Python/S3 connection pooling #14572

symroe · 2024-02-18T15:08:11Z

Description

I've looked at the issues, docs and asked this question on SO and I think this is a missing-but-useful feature.

As per my SO question, I am using Polars on AWS Lambda. I care about response times (Polars helps a lot with this, thanks!), but I believe there is currently no way to cache/pool connections to AWS between invocations.

A typical pattern with Boto3 is to define the client outside of the AWS handler function, meaning that a new connection is only created for a cold-start. I think that Polars will make a new connection each time read_parquet (etc) is called.

I think when calling S3 this will lead to some authentication overheard for each request that I think could be avoided if I could pass in a connection object that's pooled / in a global scope.

I've not yet benchmarked this, but it's possible that with small parquet files it might be faster to request the file with a pooled boto3 connection and read the data from boto into polars.

Ideally I'd be able to manage everything in Polars somehow.

The text was updated successfully, but these errors were encountered:

symroe · 2024-02-18T19:07:16Z

I've managed to run some benchmarks now.

Here's what I did:

I have a load of parquet files on S3 that range from ~2kb to ~2mb depending on the size of the shard. I made a list of 100 random files (the same each time, so there'll be some optimisations on S3 between runs I imaging).

These files are listed in files_to_read and I hard coded the bucket name / had the AWS creds in the environment.

This was all run from my computer in the same region as the S3 bucket. I imaging all speeds would be faster on AWS (Lambda or EC2), but I'm also assuming that the relative speeds would be the same.

Method 1: Pull whole file from S3, new client each time (pl.read_parquet())
Total time: 22.967 seconds
Time per file: 0.2296688356800587
------------------------------

Method 2: scan_parquet, limit 30, collect (pl.scan_parquet().limit(30).collect())
Total time: 22.057 seconds
Time per file: 0.2205727833401761
------------------------------

Method 3: Boto3 client connection pooling, then read parquet (boto3.client() outside the loop, then use boto to read the data into pl.read_parquet)
Total time: 8.437 seconds
Time per file: 0.0843655370999477
------------------------------

With method 2, I'd assumed Polars might download less data, but it looks like it's doing more or less the same as downloading all the data. Maybe I've misunderstood how this works, or I've made some other mistake.

Either way, I think this shows the performance improvements that could be possible with my use case, with a connection pool.

thomasfrederikhoeck · 2024-02-19T09:15:59Z

It might be related to this issue #14384 and this comment: #14384 (comment)

symroe · 2024-02-21T09:28:51Z

Thanks for the speedy response!

I've build the version with 0c86901 and run my tests again, This time I used a EC2 instance (to speed up build times on my slow laptop).

The good news is that things are faster.

Before the change read_parquet took about 11.5 seconds on 100 files. After the change It's taking about 6 seconds. About 50% speed up is excellent!

However, Boto3 with a global client is still out performing Polars by a long way. The baseline for reading from boto and then passing this file on to Polars is about 2.5 seconds.

I don't know if this is becaue boto3 is doing something that object_store doesn't (e.g, it's not related to this project) or if there is still some value in being able to use a single connection that's manually passed in via storage_options.

I think this issue might need re-opening if the latter is true and this is somethingin the control of Polars. If it's not, then I expect the existing change is going to be about all this project can do. I'll leave that up to someone who knows more about it than I do!

Thanks again for the quick fix!

ritchie46 · 2024-02-21T17:26:20Z

Could you try getting Polars out of the equation and try the benchmark immediately with object-store? Curious what the baseline differences are.

symroe · 2024-02-21T21:10:40Z

I don't know Rust very well, but I took some time hacking something together.

A release version of a simple script that loops over the exact same files as my Python script actually completes slightly faster than Boto3.

Rust/object_store: ~2s
Boto3/read_parquet: ~2.5 -2.8 seconds
Polars/read_parquet(s3://...) (built from master): ~6 seconds

As I say, the new Polars version is much better than before the change, but there's something going on that's not related to object_store, I believe.

Script outline

let s3_store = AmazonS3Builder::from_env().with_bucket_name(bucket_name).build().unwrap();
let files_to_download = vec![...]
let start = Instant::now(); // Start timing

for file_path in files_to_download {
    let object_key = format!("prefix/{}", file_path);
    let object_path = Path::from(object_key);
    let data = s3_store.get(&object_path).await?.bytes().await?;
    println!("Downloaded {} bytes for {}", data.len(), file_path);
}

let duration = start.elapsed(); // End timing

println!("Total: {:?} milliseconds", duration.as_millis());

Ok(())

Of course in my Python scripts I'm also reading the parquet into Polars, but I doubt that's adding a lot of overhead.

ritchie46 · 2024-02-21T22:06:21Z

Alright interesting. Have you got some info on the Polars query you run?

Polars downloads all row groups speratedly (concurrently). And in the case of projections in downloads the columns concurently.

How many row groups have you got per file? If you create files with a single row-group does that improve? (i am on vacation so have limited internet. Cannot do cloud tests).

symroe · 2024-02-22T11:05:54Z

For this test I'm literally just calling read_parquet on the whole file and not performing any query beyond that.

I can have a go at making smaller files. The largest file contains about 80k rows, most are about 25k rows. Sorry I don't know enough about the workings of Parquet...is a row-group the same as a row?! There are 8 columns in each file, witha fair amound of duplication of column values across rows.

cjackal · 2024-02-25T16:43:01Z

For this test I'm literally just calling read_parquet on the whole file and not performing any query beyond that.

I can have a go at making smaller files. The largest file contains about 80k rows, most are about 25k rows. Sorry I don't know enough about the workings of Parquet...is a row-group the same as a row?! There are 8 columns in each file, witha fair amound of duplication of column values across rows.

You can check this function to read the number of rowgroups. It is a sharding mechanism for parquet format, designed so that parquet readers can batch their data ingestion by this unit.

symroe · 2024-02-26T07:58:36Z

Ok thanks.

I'm at the limits of my current understanding of parquet, but I think a row group might relate to the number of actual files data is split across e.g when using hive partitions?

Either way, I sampled a loads og my files including the ones in the tests and they are all 1 row group.

I've manually partitioned my data according to the domain I'm working in, so I know that when I'm using read_parquet I'm only ever downloading a single file at at time and then processing it later. All the tests above are just on one file at a time.

I did try using hive partitions (before the patch related to this issue was in) but the processing time was very slow, I assume partly because of the (lack of) connection pool. I've not tried with the patch, but as the single file downloads quickly enough using boto3 for my needs I don't think changing the sharding approach is related to the issue here.

symroe · 2024-03-21T08:29:24Z

I appreciate this issue is closed, but for anyone else with this problem who might find this issue in future:

As things currently stand it's about twice as fast to fetch files from S3 using object_store (Rust) or boto3 (python) than it is using Polars when looping over a load of calls to read_*.

My current understanding is that this is caused by re-establishing a connection to S3 for each read rather than using a single connection for all reads.

At the moment, if you want to read multiple files using the same connection, it's better to pass a file-like object to read_* than to use Polars, and to manage the connection yourself.

I appreciate the help from the maintainers in this thread 👍

symroe added the enhancement New feature or an improvement of an existing feature label Feb 18, 2024

ritchie46 mentioned this issue Feb 20, 2024

feat: Properly cache object-stores #14598

Merged

ritchie46 closed this as completed in #14598 Feb 20, 2024

c-peters added the accepted Ready for implementation label Feb 26, 2024

c-peters assigned ritchie46 Feb 26, 2024

github-project-automation bot added this to Backlog Feb 26, 2024

c-peters moved this to Done in Backlog Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python/S3 connection pooling #14572

Python/S3 connection pooling #14572

symroe commented Feb 18, 2024

symroe commented Feb 18, 2024

thomasfrederikhoeck commented Feb 19, 2024

symroe commented Feb 21, 2024

ritchie46 commented Feb 21, 2024

symroe commented Feb 21, 2024

ritchie46 commented Feb 21, 2024

symroe commented Feb 22, 2024

cjackal commented Feb 25, 2024

symroe commented Feb 26, 2024

symroe commented Mar 21, 2024

Python/S3 connection pooling #14572

Python/S3 connection pooling #14572

Comments

symroe commented Feb 18, 2024

Description

symroe commented Feb 18, 2024

thomasfrederikhoeck commented Feb 19, 2024

symroe commented Feb 21, 2024

ritchie46 commented Feb 21, 2024

symroe commented Feb 21, 2024

ritchie46 commented Feb 21, 2024

symroe commented Feb 22, 2024

cjackal commented Feb 25, 2024

symroe commented Feb 26, 2024

symroe commented Mar 21, 2024