[WIP] Provide spark catalog, dsv2 and use parquet for copy/unload #120

parisni · 2023-01-03T15:04:29Z

This PR:

merge [WIP] Issue #69: Support for DSV2 #70 for datasource v2 on master fixes Feature request upgrade to datasource v2 #119
spark catalog feature for reading and writing and DDL from spark sql to redshift (see readme.md) fixes Feature request provide spark catalog #118
a cache with TTL on s3 for each table (for analytics use cases) fixes Feature request cache queries on s3 with TTL #114
fixes empty parquet files when no rows in redshift fixes Feature request unload as parquet file #116
provides copy from parquet fixes Feature request copy data with parquet format #117
support redshift columns comments and faster tables schema discovery

…563_remove-itests-from-public Remove itests. Fix jdbc url. Update Redshift jdbc driver

…488_cleanup-fix-double-to-float Fix double type to float and cleanup

…486_avoid-log-creds datalake-486 avoid log creds

…4899_empty-string-to-null Empty string is converted to null

…sion 3.0.0 release

…un - fix for STS token aws access in progress

…ild to try out

…ion between different libraries versions! Tests pass and can compile spark-on-paasta and spark successfullygit add src/ project/

…ild file

…mmunity

Merge branch 'lbc/build-catalog' into parisni-dsv2

apparently this does not work. More importantly, agg pushdown put the load on redshift, and this might not be a good idea.

jsleight · 2023-01-03T15:57:30Z

project/project/build.properties

@@ -0,0 +1 @@
+sbt.version=1.7.1


what is this file about? I don't think it's needed, and it also doesn't match the other sbt.version

jsleight · 2023-01-03T15:59:29Z

project/plugins.sbt

-addSbtPlugin("com.github.mpeltonen" % "sbt-idea" % "1.6.0")
-
-addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.7.5")
-
-addSbtPlugin("org.scoverage" % "sbt-scoverage" % "1.5.0")
-
-addSbtPlugin("org.scalastyle" %% "scalastyle-sbt-plugin" % "0.8.0")
-
-addSbtPlugin("me.lessis" % "bintray-sbt" % "0.3.0")
-
-addSbtPlugin("com.github.gseitz" % "sbt-release" % "1.0.0")
-
-addSbtPlugin("com.jsuereth" % "sbt-pgp" % "1.0.0")
-


why removing these plugins? Some like dependency-graph as just nice to have, but I'm pretty sure others like sbt-release and sbt-pgp are needed in order to release the jars to sonatype.

Will revert that commit about sbt. For obscur reasons I need them to build in my current Dev setup.

jsleight · 2023-01-03T16:00:32Z

build.sbt

-    releaseCrossBuild := true,
    licenses += ("Apache-2.0", url("http://www.apache.org/licenses/LICENSE-2.0")),
-    releasePublishArtifactsAction := PgpKeys.publishSigned.value,


I'm not sure about changes in this file either

Same, ignore this, will revert

jsleight · 2023-01-03T17:39:21Z

README.md

@@ -570,6 +607,12 @@ and use that as a temp location for this data.
    <td>Determined by the JDBC URL's subprotocol</td>
    <td>The class name of the JDBC driver to use. This class must be on the classpath. In most cases, it should not be necessary to specify this option, as the appropriate driver classname should automatically be determined by the JDBC URL's subprotocol.</td>
 </tr>


lets add a readme entry for the unload format too

jsleight · 2023-01-03T17:51:10Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/Parameters.scala

+    "unloadformat" -> "csv",
+    "table_minutes_ttl" -> "-1"


If I understand the PR correctly, only the v2 sources implement these parameters. Do you intend to make the v1 sources respect these parameters too? I think it's ok if they don't, but then we probably should make a distinct parameters class for dsv2 to make it clearer to users about what is available in each version.

That would also let us make the v2 sources default to parquet unloadformat without breaking any backwards compatibility things.

I could backport the TTL for data source v1. But your proposal makes sense.

Dsv2 provides limit pushdown in spark 3.3.x, so dsv1 and lot of code about CSV format could just be removed instead. That's the alternate way idlike to discuss

jsleight · 2023-01-03T18:03:51Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftCatalog.scala

+import io.github.spark_redshift_community.spark.redshift
+
+
+class RedshiftCatalog extends JDBCTableCatalog {


can we add any test coverage of the catalog capabilities?

Yeah catalog is a good way to test the whole thing

jsleight · 2023-01-03T18:05:27Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftCatalog.scala

+    (ident.namespace() :+ ident.name()).map(dialect.quoteIdentifier).mkString(".")
+  }
+  override def invalidateTable(ident: Identifier): Unit = {
+    // TODO  When refresh table, then drop the s3 folder


is this a todo within this PR or for late? One of the readme entries mentioned we could invalidate the cache by doing a table refresh. I'm not sure if that is the same thing as invalidateTable.

You're right, and I am about implementing it

jsleight · 2023-01-03T18:16:34Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/v2/RedshiftScanBuilder.scala

+    val convertedReadSchema = StructType(readDataSchema()
+      .copy().map(field => field.copy(dataType = StringType)))
+    val convertedDataSchema = StructType(dataSchema.copy().map(x => x.copy(dataType = StringType)))


nitpicking: only need to do these conversions if we're in csv mode

jsleight · 2023-01-03T18:22:16Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/v2/RedshiftTable.scala

+  /**
+   * A name to identify this table. Implementations should provide a meaningful name, like the
+   * database and table name from catalog, or the location of files for this table.
+   */
+  override def name(): String = "redshift"


I don't know how this name is used in spark. Do we need to provide something more descriptive if we want users to be able to load multiple tables? We have getTableNameOrSubquery param which might be nice to include in this name depending on how it is used?

jsleight · 2023-01-03T18:26:32Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/v2/RedshiftPreProcessor.scala

+
+  val jdbcWrapper: JDBCWrapper = DefaultJDBCWrapper
+
+  private def buildUnloadStmt(


related to the other comment about v1 sources supporting ttl and parquet. iirc there is a build unload stmt method somewhere already and we might be able to share code between the v1 and v2 sources to have a single unload stmt builder?

Yeah refactoring this would be valuable

src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftCatalog.scala

jsleight · 2023-01-09T19:48:20Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftCatalog.scala

+    import org.apache.hadoop.fs.{FileSystem, Path}
+    import java.net.URI


curious why importing here instead of on the module like usual.

Well in case I d'like to move the logic somewhere else this will help a bit. Nothing really important

This avoid s3 file listings to find the last cache candidate Also use hadoop FS instead of aws low level client

88manpreet · 2023-06-15T14:24:02Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftJDBCWrapper.scala

@@ -179,15 +181,44 @@ private[redshift] class JDBCWrapper {
        val isSigned = rsmd.isSigned(i + 1)
        val nullable = rsmd.isNullable(i + 1) != ResultSetMetaData.columnNoNulls
        val columnType = getCatalystType(dataType, fieldSize, fieldScale, isSigned)
+        val comment = comments.get(columnName)
+        if(!comment.isEmpty){
+        fields(i) = StructField(columnName, columnType, nullable, comment.get)


Nit: Indentation.

88manpreet · 2023-06-15T14:25:27Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftJDBCWrapper.scala

@@ -165,7 +166,8 @@ private[redshift] class JDBCWrapper {
    // the underlying JDBC driver implementation implements PreparedStatement.getMetaData() by
    // executing the query. It looks like the standard Redshift and Postgres JDBC drivers don't do
    // this but we leave the LIMIT condition here as a safety-net to guard against perf regressions.
-    val ps = conn.prepareStatement(s"SELECT * FROM $table LIMIT 1")
+    val comments = resolveComments(conn, table)
+    val ps = conn.prepareStatement(s"SELECT * FROM $table LIMIT 0")


Is the LIMIT changed to 0, since we only care about column metadata?
And making slightly more performant.

Yeah definitely

88manpreet · 2023-06-15T14:27:43Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftJDBCWrapper.scala

@@ -179,15 +181,44 @@ private[redshift] class JDBCWrapper {
        val isSigned = rsmd.isSigned(i + 1)
        val nullable = rsmd.isNullable(i + 1) != ResultSetMetaData.columnNoNulls
        val columnType = getCatalystType(dataType, fieldSize, fieldScale, isSigned)
+        val comment = comments.get(columnName)


I believe this is added to preseve the comments?

parisni · 2023-06-15T14:32:02Z

found out 2 issues on this:

spark parrallelism to read parquet files = files number. It makes performances bad for reading after the unload. Better to just read the unload folder and skip the manifest stuff
when the query is cancelled on redshift side, then no error occurs and the lib returns a dataframe with the current state of the content (which is not complete)

88manpreet · 2023-06-15T14:40:47Z

src/it/scala/io/github/spark_redshift_community/spark/redshift/RedshiftReadSuite.scala

-    checkAnswer(
-      sqlContext.sql("select * from test_table"),
-      TestUtils.expectedData)
+    withUnloadFormat {


If I understand correctly, RedshiftReadSuite only caters towards testing 'csv' format?
If so, are there plans to extend this to 'parquet' format too?
If not, I believe the changes in this particular file are no-op.

88manpreet · 2023-06-15T14:42:21Z

src/it/scala/io/github/spark_redshift_community/spark/redshift/v2/RedshiftReadSuiteV2.scala

+  /**
+   * Create a new DataFrameReader using common options for reading from Redshift.
+   */
+  override protected def read: DataFrameReader = {


Is this test file complete? Are there plans to add more tests similar to RedshiftReadSuite.scala?

88manpreet · 2023-06-15T14:46:44Z

src/main/scala/io/github/spark_redshift_community/spark/redshift/Parameters.scala

    /**
     * The AWS SSE-KMS key to use for encryption during UNLOAD operations
     * instead of AWS's default encryption
     */
    def sseKmsKey: Option[String] = parameters.get("sse_kms_key")
+
+    /**
+     * The Int value to write for nulls when using CSV.


Slightly confused here for the comment.
Do you intend to write, "The Int value to write for ttl when using CSV."

smoy · 2023-09-01T18:46:03Z

Because of an introduction of sensitive materials recently, I have to rewrite history using the procedure here: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository

This create a lot more conflict in this pull request. If this PR is still wanted, but probably open a new one instead.

In addition, the AWS contribution has brought along many improvement that included some of the intended features of this original PR. Check #128

rxin and others added 30 commits November 8, 2017 09:29

Notes about inlining this in Databricks Runtime.

717a4ad

Make the note more obvious.

184b442

Remove itests. Fix jdbc url. Update Redshift jdbc driver

a3a39a2

Merge pull request spark-redshift-community#1 from Yelp/fdc_DATALAKE-…

cafa05f

…563_remove-itests-from-public Remove itests. Fix jdbc url. Update Redshift jdbc driver

Fix double type to float and cleanup

ab8124a

Avoid logging creds. log sql query statement only

3230aaa

Add bit and default types

3384333

Fix test

58fb829

Merge pull request spark-redshift-community#2 from Yelp/fdc_DATALAKE-…

040b4a9

…488_cleanup-fix-double-to-float Fix double type to float and cleanup

Merge pull request spark-redshift-community#3 from Yelp/fdc_DATALAKE-…

967dddb

…486_avoid-log-creds datalake-486 avoid log creds

Fix Empty string is converted to null

3ae6a9b

Fix convertion bit and test

475e7a1

Fix indentation

d16317e

Fix parenthesis

e15ccb5

Fix scalastyle

d06fe3b

Fix File line length exceeds 100 characters

689635c

Merge pull request spark-redshift-community#4 from Yelp/fdc_DATALAKE-…

0d2a130

…4899_empty-string-to-null Empty string is converted to null

First Yelp release

fbb58b3

Merge pull request spark-redshift-community#5 from Yelp/fdc_first-ver…

50dfd98

…sion 3.0.0 release

Fixed NewFilter - including hadoop-aws - s3n test is failing

90581a8

Upgraded jackson by excluding it in aws

834f0d6

force spark.avro - hadoop 2.7.7 and awsjavasdk downgraded

ea5da29

Compiles with spark 2.4.0 - amazon unmarshal error

0fe37d2

Compiling - managed to run tests but they mostly fail

da10897

Removing conn.commit() everywhere - got 88% of integration tests to r…

95cdf94

…un - fix for STS token aws access in progress

Ignoring a bunch of tests as did snowflake - close to have a green bu…

b1fa3f6

…ild to try out

sbt assembly the package into a fat jar - found the perfect coordinat…

f3bbdb7

…ion between different libraries versions! Tests pass and can compile spark-on-paasta and spark successfullygit add src/ project/

aws_variables.env gitignored

0666bc6

remove in Memory FileSystem class and clean up comments in the sbt bu…

094cc15

…ild file

Moving to external github issues - rename spName to spark-redshift-co…

866d4fd

…mmunity

parisni and others added 9 commits December 29, 2022 23:25

Fix dsv2 for spark3.2

69b1445

Make sbt build on local

eca6ea8

Merge branch 'lbc/build-catalog' into parisni-dsv2

Add redshift catalog

a72bcd1

Implement pushdown agg

7381aea

Revert "Implement pushdown agg"

cf64c1d

apparently this does not work. More importantly, agg pushdown put the load on redshift, and this might not be a good idea.

Handle redshift empty parquet files

ee6ea20

Implement table ttl in minutes

8fad123

Implement catalog write support

b133fbb

Implement copy in parquet format

511a209

parisni marked this pull request as ready for review January 3, 2023 15:05

parisni changed the title ~~Provide spark catalog, dsv2 and use parquet for copy/unload~~ [WIP] Provide spark catalog, dsv2 and use parquet for copy/unload Jan 3, 2023

Document features

89e436f

jsleight reviewed Jan 3, 2023

View reviewed changes

parisni added 2 commits January 4, 2023 10:38

Support drop, alter, rename table and database

885f7b3

Support refresh table to invalidate cache

6d17ad1

jsleight reviewed Jan 9, 2023

View reviewed changes

parisni added 2 commits January 12, 2023 17:24

Support desc redshift comments

e04356f

Speedup redshift schema discovery

869c49e

parisni force-pushed the parisni-dsv2 branch from 556e37c to 869c49e Compare January 12, 2023 17:49

Use cache manifest

48dbdd3

This avoid s3 file listings to find the last cache candidate Also use hadoop FS instead of aws low level client

88manpreet reviewed Jun 15, 2023

View reviewed changes

smoy force-pushed the master branch from 17d723c to 1e4ab94 Compare September 1, 2023 18:29

smoy closed this Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Provide spark catalog, dsv2 and use parquet for copy/unload #120

[WIP] Provide spark catalog, dsv2 and use parquet for copy/unload #120

parisni commented Jan 3, 2023 •

edited

Loading

jsleight Jan 3, 2023

jsleight Jan 3, 2023

parisni Jan 3, 2023

jsleight Jan 3, 2023

parisni Jan 3, 2023

jsleight Jan 3, 2023

jsleight Jan 3, 2023

parisni Jan 3, 2023

parisni Jan 3, 2023 •

edited

Loading

jsleight Jan 3, 2023

parisni Jan 3, 2023

jsleight Jan 3, 2023

parisni Jan 3, 2023

jsleight Jan 3, 2023

jsleight Jan 3, 2023

jsleight Jan 3, 2023

parisni Jan 3, 2023

jsleight Jan 9, 2023

parisni Jan 9, 2023

88manpreet Jun 15, 2023

88manpreet Jun 15, 2023 •

edited

Loading

parisni Jun 15, 2023

88manpreet Jun 15, 2023

parisni Jun 15, 2023

parisni commented Jun 15, 2023

88manpreet Jun 15, 2023

88manpreet Jun 15, 2023

88manpreet Jun 15, 2023

smoy commented Sep 1, 2023

		import io.github.spark_redshift_community.spark.redshift


		class RedshiftCatalog extends JDBCTableCatalog {


		val jdbcWrapper: JDBCWrapper = DefaultJDBCWrapper

		private def buildUnloadStmt(

		import org.apache.hadoop.fs.{FileSystem, Path}
		import java.net.URI

		@@ -0,0 +1 @@
		sbt.version=1.7.1

[WIP] Provide spark catalog, dsv2 and use parquet for copy/unload #120

[WIP] Provide spark catalog, dsv2 and use parquet for copy/unload #120

Conversation

parisni commented Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parisni Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

88manpreet Jun 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parisni commented Jun 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smoy commented Sep 1, 2023

parisni commented Jan 3, 2023 •

edited

Loading

parisni Jan 3, 2023 •

edited

Loading

88manpreet Jun 15, 2023 •

edited

Loading