From 128cf783fba26ec36e329ca880f04094524ab68d Mon Sep 17 00:00:00 2001 From: Xupeng Li <162375861+xupengli-db@users.noreply.github.com> Date: Mon, 22 Apr 2024 13:45:31 -0700 Subject: [PATCH] Add vacuum inventory table related description (#2918) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description We have added [a new feature](https://github.com/delta-io/delta/commit/7d41fb7bbf63af33ad228007dd6ba3800b4efe81) for VACUUM command that allows users to provide a inventory table to specify the files to be considered by VACUUM. This PR updates the documentation to reflect this feature. ## How was this patch tested? N/A. Doc updates only. ## Does this PR introduce _any_ user-facing changes? No --- docs/source/delta-utility.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/source/delta-utility.md b/docs/source/delta-utility.md index 175e7c74119..0958cb8fe31 100644 --- a/docs/source/delta-utility.md +++ b/docs/source/delta-utility.md @@ -40,6 +40,10 @@ default retention threshold for the files is 7 days. To change this behavior, se VACUUM delta.`/data/events/` RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours old VACUUM eventsTable DRY RUN -- do dry run to get the list of files to be deleted + + VACUUM eventsTable USING INVENTORY inventoryTable —- vacuum files based on a provided reservoir of files as a delta table + + VACUUM eventsTable USING INVENTORY (select * from inventoryTable) —- vacuum files based on a provided reservoir of files as spark SQL query ``` See [_](delta-batch.md#sql-support) for the steps to enable support for SQL commands. @@ -82,7 +86,6 @@ default retention threshold for the files is 7 days. To change this behavior, se deltaTable.vacuum(100); // vacuum files not required by versions more than 100 hours old ``` - .. note:: When using `VACUUM`, to configure Spark to delete files in parallel (based on the number of shuffle partitions) set the session configuration `"spark.databricks.delta.vacuum.parallelDelete.enabled"` to `"true"` . @@ -96,6 +99,17 @@ this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property `spark.databricks.delta.retentionDurationCheck.enabled` to `false`. +#### Inventory Table + +An inventory table contains a list of file paths together with their size, type (directory or not), and the last modification time. When an INVENTORY option is provided, VACUUM will consider the files listed there instead of doing the full listing of the table directory, which can be time consuming for very large tables. The inventory table can be specified as a delta table or a spark SQL query that gives the expected table schema. The schema should be as follows: + +| Column Name | Type | Description | +| -----------------| ------- | --------------------------------------- | +| path | string | fully qualified uri | +| length | integer | size in bytes | +| isDir | boolean | boolean indicating if it is a directory | +| modificationTime | integer | file update time in milliseconds | + ## Retrieve Delta table history