Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Snowflake (#5500) #5502

Merged
merged 9 commits into from
Jan 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,11 @@ jobs:

- name: Make target directories
if: github.event_name != 'pull_request' && (startsWith(github.ref, 'refs/tags/v') || github.ref == 'refs/heads/main')
run: mkdir -p scio-bom/target scio-tensorflow/target site/target scio-cassandra/cassandra3/target scio-elasticsearch/es8/target scio-jdbc/target scio-macros/target scio-grpc/target scio-elasticsearch/common/target scio-test/target scio-avro/target scio-elasticsearch/es7/target scio-redis/target scio-extra/target scio-test/parquet/target scio-test/core/target scio-google-cloud-platform/target scio-smb/target scio-test/google-cloud-platform/target scio-neo4j/target scio-parquet/target scio-core/target scio-repl/target project/target
run: mkdir -p scio-bom/target scio-tensorflow/target site/target scio-cassandra/cassandra3/target scio-elasticsearch/es8/target scio-jdbc/target scio-macros/target scio-grpc/target scio-elasticsearch/common/target scio-test/target scio-avro/target scio-elasticsearch/es7/target scio-snowflake/target scio-redis/target scio-extra/target scio-test/parquet/target scio-test/core/target scio-google-cloud-platform/target scio-smb/target scio-test/google-cloud-platform/target scio-neo4j/target scio-parquet/target scio-core/target scio-repl/target project/target

- name: Compress target directories
if: github.event_name != 'pull_request' && (startsWith(github.ref, 'refs/tags/v') || github.ref == 'refs/heads/main')
run: tar cf targets.tar scio-bom/target scio-tensorflow/target site/target scio-cassandra/cassandra3/target scio-elasticsearch/es8/target scio-jdbc/target scio-macros/target scio-grpc/target scio-elasticsearch/common/target scio-test/target scio-avro/target scio-elasticsearch/es7/target scio-redis/target scio-extra/target scio-test/parquet/target scio-test/core/target scio-google-cloud-platform/target scio-smb/target scio-test/google-cloud-platform/target scio-neo4j/target scio-parquet/target scio-core/target scio-repl/target project/target
run: tar cf targets.tar scio-bom/target scio-tensorflow/target site/target scio-cassandra/cassandra3/target scio-elasticsearch/es8/target scio-jdbc/target scio-macros/target scio-grpc/target scio-elasticsearch/common/target scio-test/target scio-avro/target scio-elasticsearch/es7/target scio-snowflake/target scio-redis/target scio-extra/target scio-test/parquet/target scio-test/core/target scio-google-cloud-platform/target scio-smb/target scio-test/google-cloud-platform/target scio-neo4j/target scio-parquet/target scio-core/target scio-repl/target project/target

- name: Upload target directories
if: github.event_name != 'pull_request' && (startsWith(github.ref, 'refs/tags/v') || github.ref == 'refs/heads/main')
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ Scio includes the following artifacts:
- `scio-redis`: add-on for Redis
- `scio-repl`: extension of the Scala REPL with Scio specific operations
- `scio-smb`: add-on for Sort Merge Bucket operations
- `scio-snowflake`: add-on for Snowflake IO
- `scio-tensorflow`: add-on for TensorFlow TFRecords IO and prediction
- `scio-test`: all following test utilities. Add to your project as a "test" dependency
- `scio-test-core`: test core utilities
Expand Down
20 changes: 20 additions & 0 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -671,6 +671,7 @@ lazy val scio = project
`scio-redis`,
`scio-repl`,
`scio-smb`,
`scio-snowflake`,
`scio-tensorflow`,
`scio-test-core`,
`scio-test-google-cloud-platform`,
Expand Down Expand Up @@ -1265,6 +1266,25 @@ lazy val `scio-parquet` = project
)
)

lazy val `scio-snowflake` = project
.in(file("scio-snowflake"))
.dependsOn(
`scio-core` % "compile;test->test"
)
.settings(commonSettings)
.settings(
description := "Scio add-on for Snowflake",
libraryDependencies ++= Seq(
// compile
"com.nrinaudo" %% "kantan.codecs" % kantanCodecsVersion,
"com.nrinaudo" %% "kantan.csv" % kantanCsvVersion,
"joda-time" % "joda-time" % jodaTimeVersion,
"org.apache.beam" % "beam-sdks-java-core" % beamVersion,
"org.apache.beam" % "beam-sdks-java-io-snowflake" % beamVersion
),
tlMimaPreviousVersions := Set.empty // TODO: remove once released
)

val tensorFlowMetadataSourcesDir =
settingKey[File]("Directory containing TensorFlow metadata proto files")
val tensorFlowMetadata = taskKey[Seq[File]]("Retrieve TensorFlow metadata proto files")
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
/*
* Copyright 2024 Spotify AB.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package com.spotify.scio.snowflake

import scala.util.chaining._
import com.spotify.scio.ScioContext
import com.spotify.scio.coders.{Coder, CoderMaterializer}
import com.spotify.scio.io.{EmptyTap, EmptyTapOf, ScioIO, Tap, TapT, TestIO}
import com.spotify.scio.util.ScioUtil
import com.spotify.scio.values.SCollection
import kantan.csv.{RowCodec, RowDecoder, RowEncoder}
import org.apache.beam.sdk.io.snowflake.SnowflakeIO.{CsvMapper, UserDataMapper}
import org.apache.beam.sdk.io.snowflake.data.SnowflakeTableSchema
import org.apache.beam.sdk.io.snowflake.enums.{CreateDisposition, WriteDisposition}
import org.apache.beam.sdk.io.{snowflake => beam}
import org.joda.time.Duration

object SnowflakeIO {

final def apply[T](opts: SnowflakeConnectionOptions, query: String): SnowflakeIO[T] =
new SnowflakeIO[T] with TestIO[T] {
final override val tapT = EmptyTapOf[T]
override def testId: String = s"SnowflakeIO(${snowflakeIoId(opts, query)})"
}

private[snowflake] def snowflakeIoId(opts: SnowflakeConnectionOptions, target: String): String = {
// source params
val params = Option(opts.database).map(db => s"db=$db") ++
Option(opts.warehouse).map(db => s"warehouse=$db")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this to other complex testId implementations in our Scio IOs (example 1, 2), I think a format like this would fit better:

SnowflakeIO(url, target, warehouse?, db?)

wdyt @RustedBones

Copy link
Contributor

@RustedBones RustedBones Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as JdbcIO where we 'normalize' the connection url, hiding credentials. See https://docs.snowflake.com/en/developer-guide/jdbc/jdbc-parameters

s"${opts.url}${params.mkString("?", "&", "")}:$target"
}

object ReadParam {
type ConfigOverride[T] = beam.SnowflakeIO.Read[T] => beam.SnowflakeIO.Read[T]

val DefaultStagingBucketName: String = null
val DefaultQuotationMark: String = null
val DefaultConfigOverride = null
}
final case class ReadParam[T](
storageIntegrationName: String,
stagingBucketName: String = ReadParam.DefaultStagingBucketName,
quotationMark: String = ReadParam.DefaultQuotationMark,
configOverride: ReadParam.ConfigOverride[T] = ReadParam.DefaultConfigOverride
)

object WriteParam {
type ConfigOverride[T] = beam.SnowflakeIO.Write[T] => beam.SnowflakeIO.Write[T]

val DefaultTableSchema: SnowflakeTableSchema = null
val DefaultCreateDisposition: CreateDisposition = null
val DefaultWriteDisposition: WriteDisposition = null
val DefaultSnowPipe: String = null
val DefaultShardNumber: Integer = null
val DefaultFlushRowLimit: Integer = null
val DefaultFlushTimeLimit: Duration = null
val DefaultStorageIntegrationName: String = null
val DefaultStagingBucketName: String = null
val DefaultQuotationMark: String = null
val DefaultConfigOverride = null
}
final case class WriteParam[T](
tableSchema: SnowflakeTableSchema = WriteParam.DefaultTableSchema,
createDisposition: CreateDisposition = WriteParam.DefaultCreateDisposition,
writeDisposition: WriteDisposition = WriteParam.DefaultWriteDisposition,
snowPipe: String = WriteParam.DefaultSnowPipe,
shardNumber: Integer = WriteParam.DefaultShardNumber,
flushRowLimit: Integer = WriteParam.DefaultFlushRowLimit,
flushTimeLimit: Duration = WriteParam.DefaultFlushTimeLimit,
storageIntegrationName: String = WriteParam.DefaultStorageIntegrationName,
stagingBucketName: String = WriteParam.DefaultStagingBucketName,
quotationMark: String = WriteParam.DefaultQuotationMark,
configOverride: WriteParam.ConfigOverride[T] = WriteParam.DefaultConfigOverride
)

private[snowflake] def dataSourceConfiguration(connectionOptions: SnowflakeConnectionOptions) =
beam.SnowflakeIO.DataSourceConfiguration
.create()
.withUrl(connectionOptions.url)
.pipe { ds =>
import SnowflakeAuthenticationOptions._
Option(connectionOptions.authenticationOptions).fold(ds) {
case UsernamePassword(username, password) =>
ds.withUsernamePasswordAuth(username, password)
case KeyPair(username, privateKeyPath, None) =>
ds.withKeyPairPathAuth(username, privateKeyPath)
case KeyPair(username, privateKeyPath, Some(passphrase)) =>
ds.withKeyPairPathAuth(username, privateKeyPath, passphrase)
case OAuthToken(token) =>
ds.withOAuth(token).withAuthenticator("oauth")
}
}
.pipe(ds => Option(connectionOptions.database).fold(ds)(ds.withDatabase))
.pipe(ds => Option(connectionOptions.role).fold(ds)(ds.withRole))
.pipe(ds => Option(connectionOptions.warehouse).fold(ds)(ds.withWarehouse))
.pipe(ds =>
Option(connectionOptions.loginTimeout)
.map[Integer](_.getStandardSeconds.toInt)
.fold(ds)(ds.withLoginTimeout)
)
.pipe(ds => Option(connectionOptions.schema).fold(ds)(ds.withSchema))

private[snowflake] def csvMapper[T: RowDecoder]: CsvMapper[T] = { (parts: Array[String]) =>
val unsnowedParts = parts.map {
case "\\N" => "" // needs to be mapped to an Option
case other => other
}.toSeq
RowDecoder[T].unsafeDecode(unsnowedParts)
}

private[snowflake] def userDataMapper[T: RowEncoder]: UserDataMapper[T] = { (element: T) =>
RowEncoder[T].encode(element).toArray
}
}

sealed trait SnowflakeIO[T] extends ScioIO[T]

final case class SnowflakeSelect[T](connectionOptions: SnowflakeConnectionOptions, query: String)(
implicit
rowDecoder: RowDecoder[T],
coder: Coder[T]
) extends SnowflakeIO[T] {

import SnowflakeIO._

override type ReadP = ReadParam[T]
override type WriteP = Unit
override val tapT: TapT.Aux[T, Nothing] = EmptyTapOf[T]

override def testId: String = s"SnowflakeIO(${snowflakeIoId(connectionOptions, query)})"

override protected def read(sc: ScioContext, params: ReadP): SCollection[T] = {
val tempDirectory = ScioUtil.tempDirOrDefault(params.stagingBucketName, sc).toString
val t = beam.SnowflakeIO
.read[T]()
.fromQuery(query)
.withDataSourceConfiguration(dataSourceConfiguration(connectionOptions))
.withStorageIntegrationName(params.storageIntegrationName)
.withStagingBucketName(tempDirectory)
.pipe(r => Option(params.quotationMark).fold(r)(r.withQuotationMark))
.withCsvMapper(csvMapper)
.withCoder(CoderMaterializer.beam(sc, coder))
.pipe(r => Option(params.configOverride).fold(r)(_(r)))

sc.applyTransform(t)
}

override protected def write(data: SCollection[T], params: WriteP): Tap[Nothing] =
throw new UnsupportedOperationException("SnowflakeSelect is read-only")

override def tap(params: ReadP): Tap[Nothing] = EmptyTap
}

final case class SnowflakeTable[T](connectionOptions: SnowflakeConnectionOptions, table: String)(
implicit
rowCodec: RowCodec[T], // use codec for tap
coder: Coder[T]
) extends SnowflakeIO[T] {

import SnowflakeIO._

override type ReadP = ReadParam[T]
override type WriteP = WriteParam[T]
override val tapT: TapT.Aux[T, Nothing] = EmptyTapOf[T] // TODO Create a tap

override def testId: String = s"SnowflakeIO(${snowflakeIoId(connectionOptions, table)})"

override protected def read(sc: ScioContext, params: ReadP): SCollection[T] = {
val tempDirectory = ScioUtil.tempDirOrDefault(params.stagingBucketName, sc).toString
val t = beam.SnowflakeIO
.read[T]()
.fromTable(table)
.withDataSourceConfiguration(dataSourceConfiguration(connectionOptions))
.withStorageIntegrationName(params.storageIntegrationName)
.withStagingBucketName(tempDirectory)
.pipe(r => Option(params.quotationMark).fold(r)(r.withQuotationMark))
.withCsvMapper(csvMapper)
.withCoder(CoderMaterializer.beam(sc, coder))
.pipe(r => Option(params.configOverride).fold(r)(_(r)))

sc.applyTransform(t)
}

override protected def write(data: SCollection[T], params: WriteP): Tap[Nothing] = {
val tempDirectory = ScioUtil.tempDirOrDefault(params.stagingBucketName, data.context).toString
val t = beam.SnowflakeIO
.write[T]()
.withDataSourceConfiguration(dataSourceConfiguration(connectionOptions))
.to(table)
.pipe(w => Option(params.createDisposition).fold(w)(w.withCreateDisposition))
.pipe(w => Option(params.writeDisposition).fold(w)(w.withWriteDisposition))
.pipe(w => Option(params.snowPipe).fold(w)(w.withSnowPipe))
.pipe(w => Option(params.shardNumber).fold(w)(w.withShardsNumber))
.pipe(w => Option(params.flushRowLimit).fold(w)(w.withFlushRowLimit))
.pipe(w => Option(params.flushTimeLimit).fold(w)(w.withFlushTimeLimit))
.pipe(w => Option(params.quotationMark).fold(w)(w.withQuotationMark))
.pipe(w => Option(params.storageIntegrationName).fold(w)(w.withStorageIntegrationName))
.withStagingBucketName(tempDirectory)
.withUserDataMapper(userDataMapper)
.pipe(w => Option(params.configOverride).fold(w)(_(w)))

data.applyInternal(t)
EmptyTap
}

override def tap(params: ReadP): Tap[Nothing] = EmptyTap
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
/*
* Copyright 2024 Spotify AB.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package com.spotify.scio.snowflake

import org.joda.time.Duration

sealed trait SnowflakeAuthenticationOptions

object SnowflakeAuthenticationOptions {

/**
* Snowflake username/password authentication.
*
* @param username
* username
* @param password
* password
*/
final case class UsernamePassword(
username: String,
password: String
) extends SnowflakeAuthenticationOptions

/**
* Key pair authentication.
*
* @param username
* username
* @param privateKeyPath
* path to the private key
* @param privateKeyPassphrase
* passphrase for the private key (optional)
*/
final case class KeyPair(
username: String,
privateKeyPath: String,
privateKeyPassphrase: Option[String] = None
) extends SnowflakeAuthenticationOptions

/**
* OAuth token authentication.
*
* @param token
* OAuth token
*/
final case class OAuthToken(token: String) extends SnowflakeAuthenticationOptions

}

/**
* Options for a Snowflake connection.
*
* @param authenticationOptions
* authentication options
* @param url
* Sets URL of Snowflake server in following format:
* "jdbc:snowflake://[host]:[port].snowflakecomputing.com"
* @param database
* database to use
* @param role
* user's role to be used when running queries on Snowflake
* @param warehouse
* warehouse name
* @param schema
* schema to use when connecting to Snowflake
* @param loginTimeout
* Sets loginTimeout that will be used in [[net.snowflake.client.jdbc.SnowflakeBasicDataSource]].
*/
final case class SnowflakeConnectionOptions(
url: String,
authenticationOptions: SnowflakeAuthenticationOptions = null,
database: String = null,
role: String = null,
warehouse: String = null,
schema: String = null,
loginTimeout: Duration = null
)
Loading