You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When loading data into Redshift using a CTAS statement, the Redshift connector defaults to BaseJdbcClient behaviour of using batched insert statements. While this behaviour works well for most database systems, Redshift's handling of INSERT statements is very slow and, according to this article by AWS, considered an anti-pattern. This is also the case for most OLAP systems. In real world performance, we see about 300 rows per second per Redshift node.
I was wondering if there was an appetite to improve this using the Redshift COPY statement? I think it would work as follows:
Administrator enables inserts using copy as part of the catalog properties of Redshift (new property)
Administrator provides AWS Access Key ID, Secret (or optionally role if running on AWS), and bucket + prefix (new properties)
When running a CTAS statement, Trino still creates the table as it did before, but instead of streaming data using INSERT statements, it writes query results (using parquet or csv) to S3 using at location s3://my-bucket/my-prefix/trino_query_id/parts
When all parts are written, Trino issues a COPY ... FROM ...
I can put my hand up to implement this if it is felt that this is something that would be useful to others. It seems like the way to do so based on looking through similar implementations would be to implement the PageSink and then the PageSinkProvider for this operation, as well as any surrounding credential providers. Please let me know if I am overlooking anything thinking about it this way or not considering anything major.
I have noticed that @mayankvadariya is implementing an adjacent feature in #24117 though so I don't want to step on any toes or duplicate effort if this is already being worked on.
The text was updated successfully, but these errors were encountered:
What's the use case for moving data into Redshift using Trino ?
JDBC connectors are mainly intended for the purpose of allowing ad-hoc queries and easy extraction of data into the lake (hive/delta/iceberg).
Feel free to implement it, but it's not a use case that we're typically optimizing for.
Thanks @mayankvadariya, this is great functionality. Thank you for implementing!
@raunaqmorarka For multi-cloud ETL jobs coming out of a relational database into Redshift to support other operational use cases. I definitely understand you typically see this going the other direction but all sorts of interesting use cases open up for easy movement of data in either direction.
I'll see where I can devote some cycles to look into this.
When loading data into Redshift using a CTAS statement, the Redshift connector defaults to BaseJdbcClient behaviour of using batched insert statements. While this behaviour works well for most database systems, Redshift's handling of INSERT statements is very slow and, according to this article by AWS, considered an anti-pattern. This is also the case for most OLAP systems. In real world performance, we see about 300 rows per second per Redshift node.
I was wondering if there was an appetite to improve this using the Redshift
COPY
statement? I think it would work as follows:s3://my-bucket/my-prefix/trino_query_id/parts
COPY ... FROM ...
I can put my hand up to implement this if it is felt that this is something that would be useful to others. It seems like the way to do so based on looking through similar implementations would be to implement the PageSink and then the PageSinkProvider for this operation, as well as any surrounding credential providers. Please let me know if I am overlooking anything thinking about it this way or not considering anything major.
I have noticed that @mayankvadariya is implementing an adjacent feature in #24117 though so I don't want to step on any toes or duplicate effort if this is already being worked on.
The text was updated successfully, but these errors were encountered: