Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/0.3.0 #61

Merged
merged 9 commits into from
Nov 26, 2024
Merged

Release/0.3.0 #61

merged 9 commits into from
Nov 26, 2024

Conversation

istreeter
Copy link
Collaborator

@istreeter istreeter commented Nov 26, 2024

Jira ref: PDP-1551

In common-streams 0.8.x we shifted alerting / retrying / webhook out of
the applications and into the common library.  It also adds new features
like heartbeat webhooks starting when the loader first becomes healthy.

This commit also makes the webhook alert messages more human-friendly.
Compared to common-streams 0.8.0-M2, this version adds:

- Re-implemented Kinesis source without fs2-kinesis
- Pubsub source opens more transport channels when necessary
- Changes default webhook heartbeat period to 5 minutes
- Http4s Client with configuration appropriate for common-streams apps

Other changes to common-streams are not relevant for snowflake loader,
so not mentioned here.
Since the 0.2.4 release, the application logs contain lots of extra log
lines, e.g.:

```
Oct 21, 2024 8:19:50 PM net.snowflake.client.jdbc.cloud.storage.SnowflakeS3Client upload
INFO: Starting upload from stream (byte stream) to S3 location: <redacted>/streaming_ingest/2024/10/21/20/19/<redacted>_1003_24_0.bdec
```

This fix makes the snowflake sdk use SLF4J as the logger. Therefore it
only logs at WARN or above by default.
See snowplow-incubator/common-streams#97 for details.

This fixes an edge-case problem in which the loader did not properly
load contexts if they were not JSON objects.
These changes allow the loader to better utilize all cpu available on a
larger instance.

**1. CPU-intensive parsing/transforming is now parallelized**.
Parallelism is configured by a new config parameter
`cpuParallelismFraction`. The actual parallelism is chosen dynamically
based on the number of available CPU, so the default value should be
appropriate for all sized VMs.

**2. We now open a new Snowflake ingest client per channel**. Note the
Snowflake SDK recommends to re-use a single Client per VM and open
multiple Channels on the same Client.  So here we are going against the
recommendations.  But, we justify it because it gives the loader better
visiblity of when the client's Future completes, signifying a complete
write to Snowflake.

**3. Upload parallelism chosen dynamically**. Larger VMs benefit from
higher upload parallelism, in order to keep up with the faster rate of
batches produced by the cpu-intensive tasks. Parallelsim is configured
by a new parameter `uploadParallelismFactor`, which gets multiplied by
the number of available CPU. The default value should be appropriate for
all sized VMs.

These new settings have been tested on pods ranging from 0.6 to 8
available CPU.
* Bump common-streams to 0.9.0

See snowplow-incubator/common-streams#99 for the relevant change

This library upgrade brings improvements to the Kinesis source, which
should help on vertically larger instances.

* Bump snowflake-ingest-sdk to 3.0.0
@istreeter istreeter merged commit 1f47b05 into main Nov 26, 2024
3 checks passed
@istreeter istreeter deleted the release/0.3.0 branch November 26, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant