Replies: 3 comments 13 replies
-
Is the schema known at compile time or is it dynamic? |
Beta Was this translation helpful? Give feedback.
-
So the good news is this works! Here's what I've got so far: https://github.com/adriangb/pgpq |
Beta Was this translation helpful? Give feedback.
-
Interestingly I think I may have found a bug while working on this: #3646 |
Beta Was this translation helpful? Give feedback.
-
I couldn't find anything out there so I'm looking into writing something to move data from parquet files (possibly stored in an object store or HTTP) into Postgres as fast as possible. For my use case I have a couple of requirements:
My initial attempt was using Polars but unfortunately it is not able to read Parquet files in batches efficiently and converting from Arrow -> Python types (via Polars) and then Python types -> Postgres (via asyncpg) is slow.
So naturally I'm looking at doing this in Rust to speed things up.
Here's my plan, I want to see if it seems viable. I'll have a Rust library that goes from
RecordBatch
to Postgres' binary format. It'll look something like this:And then a Python side that manages the IO so that it can decide to stream the bytes into Postgres directly, write to a file somewhere, read the parquet file from disk, in threads for an async environment, etc. That would look something like this:
I've already written something to encode Rust types into Postgres types based on rust-postgres-binary-copy. Where I'm a bit confused is on how to iterate over a RecordBatch and get Rust types out of it. Alternatively I guess we could iterate over a RecordBatch and get arrow data out of it but then I'd need to re-implement the
postgres_types::ToSql
trait for all arrow types. And I expect an extra hop from Arrow -> Rust native type will be quite cheap.Beta Was this translation helpful? Give feedback.
All reactions