Skip to content

Commit

Permalink
Add vacuum inventory table related description (delta-io#2918)
Browse files Browse the repository at this point in the history
#### Which Delta project/connector is this regarding?

- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

We have added [a new
feature](delta-io@7d41fb7)
for VACUUM command that allows users to provide a inventory table to
specify the files to be considered by VACUUM. This PR updates the
documentation to reflect this feature.

## How was this patch tested?

N/A. Doc updates only.

## Does this PR introduce _any_ user-facing changes?

No
  • Loading branch information
xupengli-db authored Apr 22, 2024
1 parent e75e4f9 commit 128cf78
Showing 1 changed file with 15 additions and 1 deletion.
16 changes: 15 additions & 1 deletion docs/source/delta-utility.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ default retention threshold for the files is 7 days. To change this behavior, se
VACUUM delta.`/data/events/` RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours old

VACUUM eventsTable DRY RUN -- do dry run to get the list of files to be deleted

VACUUM eventsTable USING INVENTORY inventoryTable —- vacuum files based on a provided reservoir of files as a delta table

VACUUM eventsTable USING INVENTORY (select * from inventoryTable) —- vacuum files based on a provided reservoir of files as spark SQL query
```

See [_](delta-batch.md#sql-support) for the steps to enable support for SQL commands.
Expand Down Expand Up @@ -82,7 +86,6 @@ default retention threshold for the files is 7 days. To change this behavior, se
deltaTable.vacuum(100); // vacuum files not required by versions more than 100 hours old
```


.. note::
When using `VACUUM`, to configure Spark to delete files in parallel (based on the number of shuffle partitions) set the session configuration `"spark.databricks.delta.vacuum.parallelDelete.enabled"` to `"true"` .

Expand All @@ -96,6 +99,17 @@ this table that take longer than the retention interval you plan to specify,
you can turn off this safety check by setting the Spark configuration property
`spark.databricks.delta.retentionDurationCheck.enabled` to `false`.

#### Inventory Table

An inventory table contains a list of file paths together with their size, type (directory or not), and the last modification time. When an INVENTORY option is provided, VACUUM will consider the files listed there instead of doing the full listing of the table directory, which can be time consuming for very large tables. The inventory table can be specified as a delta table or a spark SQL query that gives the expected table schema. The schema should be as follows:

| Column Name | Type | Description |
| -----------------| ------- | --------------------------------------- |
| path | string | fully qualified uri |
| length | integer | size in bytes |
| isDir | boolean | boolean indicating if it is a directory |
| modificationTime | integer | file update time in milliseconds |

<a id="delta-history"></a>

## Retrieve Delta table history
Expand Down

0 comments on commit 128cf78

Please sign in to comment.