risingwavelabs · WanYixian · Nov 25, 2024 · Nov 15, 2024 · Nov 18, 2024 · Nov 18, 2024
diff --git a/delivery/overview.mdx b/delivery/overview.mdx
@@ -124,3 +124,32 @@ WITH (
 <Note>
 File sink currently supports only append-only mode, so please change the query to `append-only` and specify this explicitly after the `FORMAT ... ENCODE ...` statement.
 </Note>
+
+## Batching strategy for file sink
+
+RisingWave implements batching strategies for file sinks to optimize file management by preventing the generation of numerous small files. The batching strategy is available for Parquet, JSON, and CSV encode. 
+
+### Category
+
+There are two primary batching strategies:
+
+- **Batching based on row numbers**:
+    For batching based on row count, RisingWave checks whether the maximum row count threshold has been reached after each chunk is written (`sink_writer.write_batch()`). If the threshold is met, the writing of the file is completed.
+
+- **Batching based on rollover interval**:
+    For batching based on the time interval, RisingWave checks the threshold each time a chunk is about to be written (`sink_writer.write_batch()`) and when a barrier is encountered (`sink_writer.barrier()`).
+
+<Note>The condition for batching is relatively coarse-grained. The actual number of rows or exact timing of file completion may vary from the specified thresholds, as this function is intentionally flexible to prioritize efficient file management.</Note>
+
+If no conditions for batch collection are set, RisingWave will apply a default batching strategy to ensure proper file writing and data consistency.
+
+### File naming rule
+
+Previously, the naming convention for files was `executor_id + epoch.suffix`. With the decoupling of barriers and file writing, the epoch is no longer needed in the file name. However, the `executor_id` is still required, as RisingWave does not perform file merging between different levels of parallelism.
+
+The current file naming rule is `/Option<partition_by>/executor_id + timestamp.suffix`, where the timestamp differentiates files batched by the rollover interval. For example, the output files look like below:
+
+```
+path/2024-09-20/47244640257_1727072046.parquet
+path/2024-09-20/47244640257_1727072055.parquet
+```
diff --git a/mint.json b/mint.json
@@ -167,6 +167,7 @@
     {"source": "/docs/current/architecture", "destination": "/reference/architecture"},
     {"source": "/docs/current/fault-tolerance", "destination": "/reference/fault-tolerance"},
     {"source": "/docs/current/limitations", "destination": "/reference/limitations"},
+    {"source": "/docs/current/sources", "destination": "/integrations/sources/overview"},
     {"source": "/docs/current/sql-alter-connection", "destination": "/sql/commands/sql-alter-connection"},
     {"source": "/docs/current/sql-alter-database", "destination": "/sql/commands/sql-alter-database"},
     {"source": "/docs/current/sql-alter-function", "destination": "/sql/commands/sql-alter-function"},