feat(sink): implement snowflake sink #15429

xzhseh · 2024-03-04T22:55:18Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Detailed spec will be added when finalizing this PR.

The general sink logic for snowflake-sink is a little bit different from others, since snowflake only support sink (e.g., insertFiles REST API) from three external stage storages (e.g., aws, azure, and gcp), then we must somehow first upload the corresponding data files to an user provided s3 bucket, and then trigger the snowflake pipe to copy from that specific external staged storage.

To keep everthing simple at present, we only support amazon s3 as an external staged storage.

For detailed snowpipe workflow, please refer to: https://docs.snowflake.com/user-guide/data-load-snowpipe-rest-overview.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

Detailed spec will be updated in Notion later.

github-actions

license-eye has totally checked 4863 files.

Valid	Invalid	Ignored	Fixed
2101	1	2761	0

Click to see the invalid file list

src/connector/src/sink/snowflake_connector.rs

src/connector/src/sink/snowflake_connector.rs

xxhZs · 2024-03-28T04:21:05Z

Also need to add an example to integration_tests

xzhseh · 2024-04-04T15:22:25Z

Also need to add an example to integration_tests

problem is - we always need an external s3 bucket during any potential use-case (including integration tests), plus a valid snowflake account for the final sink part.

fuyufjh

LGTM.

I think log queue, i.e. buffering data in multiple epochs before commit, is a must-have thing for Snowflake. Let's do it later.

src/connector/src/sink/snowflake.rs

tabVersion

lgtm as the first version. please fix the test also.

xxhZs · 2024-04-09T07:04:44Z

Also need to add an example to integration_tests

problem is - we always need an external s3 bucket during any potential use-case (including integration tests), plus a valid snowflake account for the final sink part.

Ok ok, that can be done without integrating it into our ci, just as an example

wenym1 · 2024-04-09T08:06:39Z

src/connector/src/sink/snowflake.rs

+
+    /// Construct the *unique* file suffix for the sink
+    fn file_suffix(&self) -> String {
+        format!("{}_{}", self.epoch, self.sink_file_suffix)


The sink_file_suffix is local to each parallelism of snowflake writer, so there might be a same sink_file_suffix across multiple parallelisms. Do we need to ensure uniqueness across multiple parallelisms?

as long as epoch remains different, everthing should be fine - actually I'm thinking to only include the epoch as the unique identifier for the s3 intermediate file(s), the sink_file_suffix does not really matter, cc @xxhZs @fuyufjh.

refer: #15429 (comment)

plus, the context for this uniqueness is due to snowflake pipe - which will implicitly refuse the sink request (i.e., from insertFiles) to the intermediate s3 file with the same name, even if the status code being returned is 200.

But different parallelisms share the same epochs. Let's say we are now at epoch 233, and we have two sink parallelisms. The first file name in this epoch of both executors are both <s3_path>/233-0. In this case, will the later written one overwrite the firstly written one?

Overwritten will not happen, but it's even worse 🫠 - the latter one upload the data to the same file on external S3, and we trigger the insertFiles request to snowflake pipe; but the file name remains the same, snowflake then implicitly treats this as redundant, and the new version will not be loaded into the pipe, which means the data in the latter file simply gets lost...🙃

Thus, we need another identifier for the parallel writer(s) scenario - any suggestion?

uuid is fine. To help debug we can use <epoch>-<uuid> as the file name.

wenym1 · 2024-04-09T08:08:03Z

src/connector/src/sink/snowflake.rs

+            return Ok(());
+        }
+        // first sink to the external stage provided by user (i.e., s3)
+        self.s3_client


Can we use streaming upload instead of buffering the data by ourselves? We can use the streaming upload of opendal or the streaming upload implemented by ourselves with the aws sdk.

yep I think streaming upload is possible - a possible implementation would probably be something like this: https://gist.github.com/ivormetcalf/f2b8e6abfece4328c86ad1ee34363caf

Actually there is no need to reimplement it again. In our object store crate, we have implemented streaming upload for both aws s3 sdk and opendal. For simplicity we can use opendal. You may see the implementation in the following code.

risingwave/src/object_store/src/object/opendal_engine/opendal_object_store.rs

Line 87 in 5fe4222

async fn streaming_upload(&self, path: &str) -> ObjectResult<BoxedStreamingUploader> {

risingwave/src/object_store/src/object/s3.rs

Line 340 in 5fe4222

async fn streaming_upload(&self, path: &str) -> ObjectResult<BoxedStreamingUploader> {

wenym1 · 2024-04-09T08:08:47Z

src/connector/src/sink/mod.rs

@@ -32,6 +32,8 @@ pub mod nats;
 pub mod pulsar;
 pub mod redis;
 pub mod remote;
+pub mod snowflake;


May have a separate folder snowflake to hold the files related to snowflake.

currently we only have snowflake.rs for the core sinking logic, plus snowflake_connector.rs for the helper clients (i.e., rest api client, s3 client) implementations - let's keep it simple at present, and move things around when it gets bigger in the future.

wenym1 · 2024-04-09T08:11:52Z

src/connector/src/sink/snowflake_connector.rs

+const S3_INTERMEDIATE_FILE_NAME: &str = "RW_SNOWFLAKE_S3_SINK_FILE";
+
+/// The helper function to generate the s3 file name
+fn generate_s3_file_name(s3_path: Option<String>, suffix: String) -> String {


Seems that we implemented the functionality of writing to snowflake from scratch. If there is no other implementation from other repo, in the future we may move the logic here to a separate repo specially for snowflake, so that users in snowflake community can reuse our implementation.

wenym1 · 2024-04-09T08:14:04Z

src/connector/src/sink/snowflake.rs

+    }
+
+    async fn barrier(&mut self, _is_checkpoint: bool) -> Result<Self::CommitMetadata> {
+        Ok(())


Yes. We can following the similar implementation from iceberg #15634

fuyufjh · 2024-04-09T08:29:44Z

Also need to add an example to integration_tests

problem is - we always need an external s3 bucket during any potential use-case (including integration tests), plus a valid snowflake account for the final sink part.

Ok ok, that can be done without integrating it into our ci, just as an example

Agree. It's hard to make it work in CI because Snowflake is a SaaS service. An example is good enough in this cases.

xzhseh · 2024-04-10T00:10:17Z

plus, after some investigations, rusoto_s3 seems a perfect match with snowflake sink's use case (and even partial of our file sink) - if it looks good, I can refactor the current hand-made S3Client in subsequent pr(s).

cc @fuyufjh @xxhZs.

xzhseh · 2024-04-10T00:38:11Z

Also need to add an example to integration_tests

problem is - we always need an external s3 bucket during any potential use-case (including integration tests), plus a valid snowflake account for the final sink part.

Ok ok, that can be done without integrating it into our ci, just as an example

I'll add a detailed spec / example afterwards, any suggestion on where to put it? (e.g., integration_tests/snowflake_-sink/...?)

basic structure for snowflake sink

82c03f9

xzhseh self-assigned this Mar 4, 2024

github-actions bot added the type/feature label Mar 4, 2024

xzhseh added 4 commits March 5, 2024 20:03

fix format

03c5aa0

Merge branch 'main' into xzhseh/snowflake-sink

4f7cac9

update snowflake common

e9da466

add snowflake_connector.rs

805b6a2

github-actions bot reviewed Mar 7, 2024

View reviewed changes

src/connector/src/sink/snowflake_connector.rs Show resolved Hide resolved

xzhseh added 5 commits March 7, 2024 19:09

add snowflake inserter (and builder)

f0657e2

update license

b6bdd34

add snowflake http client

827a1d9

update fmt

e19c3ea

remove redundant import

53ef2c5

neverchanje self-requested a review March 8, 2024 06:33

xzhseh added 13 commits March 8, 2024 20:04

add jwt_token auto-generation

25d3aef

add SnowflakeS3Client

fb0cade

update SnowflakeSinkWriter

cd4168c

set three SinkWriter functions to return Ok

7a9fdf9

add log sinker

db090a9

basic sink funtionality with json encoder

95310bf

add comments && update sink_to_s3

e46b51c

add file num to send_request

cd6f587

fix typo

4caa11f

add aws credentials to prevent load_from_env

00d548d

enable basic snowflake sink pipeline

805c44f

improve format

5b26ccd

update comment

8bc0bf4

xzhseh marked this pull request as ready for review March 12, 2024 20:40

xzhseh requested a review from a team as a code owner March 12, 2024 20:40

xzhseh requested a review from fuyufjh March 12, 2024 20:40

update validate to ensure append-only

9ceab04

xzhseh force-pushed the xzhseh/snowflake-sink branch from e4bebb1 to 9ceab04 Compare March 18, 2024 19:16

xzhseh added 2 commits March 18, 2024 16:13

support s3_path for configuration

af53821

udpate fmt

d877b14

xzhseh added 2 commits April 4, 2024 11:24

update comments

dbc468a

add reference to snowpipe rest api

426cb61

xzhseh requested a review from xxhZs April 8, 2024 15:39

fuyufjh self-requested a review April 9, 2024 03:59

fuyufjh approved these changes Apr 9, 2024

View reviewed changes

src/connector/src/sink/snowflake.rs Outdated Show resolved Hide resolved

tabVersion approved these changes Apr 9, 2024

View reviewed changes

wenym1 reviewed Apr 9, 2024

View reviewed changes

xzhseh added 2 commits April 9, 2024 11:48

update error msg & comments

4b257d5

Merge branch 'main' into xzhseh/snowflake-sink

6373fdc

update with_options_sink accordingly

583964a

use uuid to ensure the global uniqueness of file suffix

9e52dc2

xzhseh added this pull request to the merge queue Apr 10, 2024

Merged via the queue into main with commit 254ad0c Apr 10, 2024
27 of 28 checks passed

xzhseh deleted the xzhseh/snowflake-sink branch April 10, 2024 17:18

xzhseh mentioned this pull request Apr 10, 2024

feat(snowflake-sink): add example use case & detailed spec; fix a subtle problem regarding file_suffix #16241

Merged

9 tasks

BugenZhao mentioned this pull request Apr 11, 2024

refactor(connector): replace hyper client implementation with reqwest #16146

Merged

4 tasks

xzhseh mentioned this pull request Apr 11, 2024

feat(snowflake-sink): change to streaming upload instead of batched bulk load #16269

Merged

9 tasks

wcy-fdu pushed a commit that referenced this pull request Apr 15, 2024

feat(sink): implement snowflake sink (#15429)

924ac78

BugenZhao mentioned this pull request May 21, 2024

risingwave 1.9.1 risingwavelabs/homebrew-risingwave#38

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sink): implement snowflake sink #15429

feat(sink): implement snowflake sink #15429

xzhseh commented Mar 4, 2024 •

edited

Loading

github-actions bot left a comment

xxhZs commented Mar 28, 2024 •

edited

Loading

xzhseh commented Apr 4, 2024

fuyufjh left a comment

tabVersion left a comment

xxhZs commented Apr 9, 2024

wenym1 Apr 9, 2024

xzhseh Apr 9, 2024 •

edited

Loading

wenym1 Apr 10, 2024

xzhseh Apr 10, 2024 •

edited

Loading

wenym1 Apr 10, 2024

wenym1 Apr 9, 2024

xzhseh Apr 9, 2024

wenym1 Apr 10, 2024

wenym1 Apr 9, 2024

xzhseh Apr 9, 2024

wenym1 Apr 9, 2024

wenym1 Apr 9, 2024

fuyufjh commented Apr 9, 2024

xzhseh commented Apr 10, 2024 •

edited

Loading

xzhseh commented Apr 10, 2024

feat(sink): implement snowflake sink #15429

feat(sink): implement snowflake sink #15429

Conversation

xzhseh commented Mar 4, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

github-actions bot left a comment

Choose a reason for hiding this comment

xxhZs commented Mar 28, 2024 • edited Loading

xzhseh commented Apr 4, 2024

fuyufjh left a comment

Choose a reason for hiding this comment

tabVersion left a comment

Choose a reason for hiding this comment

xxhZs commented Apr 9, 2024

Choose a reason for hiding this comment

xzhseh Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xzhseh Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh commented Apr 9, 2024

xzhseh commented Apr 10, 2024 • edited Loading

xzhseh commented Apr 10, 2024

xzhseh commented Mar 4, 2024 •

edited

Loading

xxhZs commented Mar 28, 2024 •

edited

Loading

xzhseh Apr 9, 2024 •

edited

Loading

xzhseh Apr 10, 2024 •

edited

Loading

xzhseh commented Apr 10, 2024 •

edited

Loading