Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data ingestion overview about batch #55

Merged
merged 16 commits into from
Nov 25, 2024
Merged
29 changes: 29 additions & 0 deletions delivery/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -124,3 +124,32 @@ WITH (
<Note>
File sink currently supports only append-only mode, so please change the query to `append-only` and specify this explicitly after the `FORMAT ... ENCODE ...` statement.
</Note>

## Batching strategy for file sink

RisingWave implements batching strategies for file sinks to optimize file management by preventing the generation of numerous small files. The batching strategy is available for Parquet, JSON, and CSV encode.

### Category

There are two primary batching strategies:

- **Batching based on row numbers**:
For batching based on row count, RisingWave checks whether the maximum row count threshold has been reached after each chunk is written (`sink_writer.write_batch()`). If the threshold is met, the writing of the file is completed.
WanYixian marked this conversation as resolved.
Show resolved Hide resolved

- **Batching based on rollover interval**:
For batching based on the time interval, RisingWave checks the threshold each time a chunk is about to be written (`sink_writer.write_batch()`) and when a barrier is encountered (`sink_writer.barrier()`).
WanYixian marked this conversation as resolved.
Show resolved Hide resolved

<Note>The condition for batching is relatively coarse-grained. The actual number of rows or exact timing of file completion may vary from the specified thresholds, as this function is intentionally flexible to prioritize efficient file management.</Note>

If no conditions for batch collection are set, RisingWave will apply a default batching strategy to ensure proper file writing and data consistency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to indicate what the default policy is(IIRC 10s, pls check the origin pr).


### File naming rule

Previously, the naming convention for files was `executor_id + epoch.suffix`. With the decoupling of barriers and file writing, the epoch is no longer needed in the file name. However, the `executor_id` is still required, as RisingWave does not perform file merging between different levels of parallelism.

The current file naming rule is `/Option<partition_by>/executor_id + timestamp.suffix`, where the timestamp differentiates files batched by the rollover interval. For example, the output files look like below:

```
path/2024-09-20/47244640257_1727072046.parquet
path/2024-09-20/47244640257_1727072055.parquet
```
1 change: 1 addition & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@
{"source": "/docs/current/architecture", "destination": "/reference/architecture"},
{"source": "/docs/current/fault-tolerance", "destination": "/reference/fault-tolerance"},
{"source": "/docs/current/limitations", "destination": "/reference/limitations"},
{"source": "/docs/current/sources", "destination": "/integrations/sources/overview"},
{"source": "/docs/current/sql-alter-connection", "destination": "/sql/commands/sql-alter-connection"},
{"source": "/docs/current/sql-alter-database", "destination": "/sql/commands/sql-alter-database"},
{"source": "/docs/current/sql-alter-function", "destination": "/sql/commands/sql-alter-function"},
Expand Down