diff --git a/src/pages/latest/delta-apidoc.mdx b/src/pages/latest/delta-apidoc.mdx index 1b639bb..a801118 100644 --- a/src/pages/latest/delta-apidoc.mdx +++ b/src/pages/latest/delta-apidoc.mdx @@ -6,13 +6,13 @@ menu: docs For most read and write operations on Delta tables, you can use Apache Spark reader and writer APIs. For examples, see [Table batch reads and writes](/latest/delta-batch) and [Table streaming reads and writes](/latest/delta-streaming). -However, there are some operations that are specific to DeltaLake and you must use DeltaLake APIs. For examples, see [Table utility commands](/latest/delta-utility). +However, there are some operations that are specific to Delta Lake and you must use Delta Lake APIs. For examples, see [Table utility commands](/latest/delta-utility). - Some DeltaLake APIs are still evolving and are indicated with the **Evolving** + Some Delta Lake APIs are still evolving and are indicated with the **Evolving** qualifier in the API docs. -- [Scala API docs](/latest/api/scala/io/delta/tables/index.html) -- [Java API docs](/latest/api/java/index.html) -- [Python API docs](/latest/api/python/index.html) +- [Scala API docs](https://docs.delta.io/latest/api/scala/io/delta/tables/index.html) +- [Java API docs](https://docs.delta.io/latest/api/java/index.html) +- [Python API docs](https://docs.delta.io/latest/api/python/index.html) diff --git a/src/pages/latest/delta-change-data-feed.mdx b/src/pages/latest/delta-change-data-feed.mdx index 94fb89d..01f654b 100644 --- a/src/pages/latest/delta-change-data-feed.mdx +++ b/src/pages/latest/delta-change-data-feed.mdx @@ -15,7 +15,7 @@ You can read the change events in batch queries using DataFrame APIs (that is, ` Change Data Feed is not enabled by default. The following use cases should drive when you enable the change data feed. -- **Silver and Gold tables**: Improve Delta performance by processing only row-level changes following initial MERGE, UPDATE, or DELETE operations to accelerate and simplify ETL and ELT operations. +- **Silver and Gold tables**: Improve Delta performance by processing only row-level changes following initial `MERGE`, `UPDATE`, or `DELETE` operations to accelerate and simplify ETL and ELT operations. - **Transmit changes**: Send a change data feed to downstream systems such as Kafka or RDBMS that can use it to incrementally process in later stages of data pipelines. - **Audit trail table**: Capture the change data feed as a Delta table provides perpetual storage and efficient query capability to see all changes over time, including when deletes occur and what updates were made. @@ -216,7 +216,7 @@ In addition to the data columns, change data contains metadata columns that iden | `_commit_version` | Long | The Delta log or table version containing the change. | | `_commit_timestamp` | Timestamp | The timestamp associated when the commit was created. | -**(1)** preimage is the value before the update, postimage is the value after the update. +**(1)** `preimage` is the value before the update, `postimage` is the value after the update. ## Frequently asked questions (FAQ) diff --git a/src/pages/latest/delta-constraints.mdx b/src/pages/latest/delta-constraints.mdx index 570b5b4..e2d2490 100644 --- a/src/pages/latest/delta-constraints.mdx +++ b/src/pages/latest/delta-constraints.mdx @@ -3,7 +3,7 @@ title: Constraints menu: docs --- -Delta tables support standard SQL constraint management clauses that ensure that the quality and integrity of data added to a table is automatically verified. When a constraint is violated, Delta Lake throws an InvariantViolationException to signal that the new data can’t be added. +Delta tables support standard SQL constraint management clauses that ensure that the quality and integrity of data added to a table is automatically verified. When a constraint is violated, Delta Lake throws an `InvariantViolationException` to signal that the new data can’t be added. Adding a constraint automatically upgrades the table writer protocol version. See [Table protocol versioning](/latest/versioning) to understand table protocol versioning and what it means to upgrade the protocol version. diff --git a/src/pages/latest/delta-utility.mdx b/src/pages/latest/delta-utility.mdx index d576d78..87399ff 100644 --- a/src/pages/latest/delta-utility.mdx +++ b/src/pages/latest/delta-utility.mdx @@ -1,5 +1,5 @@ --- -title: Utility Operations +title: Table utility commands width: full menu: docs --- @@ -18,9 +18,9 @@ default retention threshold for the files is 7 days. To change this behavior, se - `vacuum` removes all files from directories not managed by Delta Lake, ignoring directories beginning with _. If you are storing additional metadata like Structured Streaming checkpoints within a Delta table directory, use a directory name such as `_checkpoints`. -- `vacuum` deletes only data files, not log files. Log files are deleted automatically and asynchronously after checkpoint operations. The default retention period of log files is 30 days, configurable through the `delta.logRetentionDuration` property which you set with the `ALTER TABLE SET TBLPROPERTIES SQL` method. See [Table properties](/latest/delta-batch/#table-properties). +- `vacuum` deletes only data files, not log files. Log files are deleted automatically and asynchronously after checkpoint operations. The default retention period of log files is 30 days, configurable through the `delta.logRetentionDuration` property which you set with the `ALTER TABLE SET TBLPROPERTIES` SQL method. See [Table properties](/latest/delta-batch/#table-properties). -- The ability to time travel back to a version older than the retention period is lost after running vacuum. +- The ability to [time travel](/latest/delta-batch#deltatimetravel) back to a version older than the retention period is lost after running `vacuum`. @@ -78,7 +78,7 @@ When using `VACUUM`, to configure Spark to delete files in parallel (based on th See the [Delta Lake APIs](/latest/delta-apidoc) for Scala, Java, and Python syntax details. - + It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If `VACUUM` cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when `VACUUM` deletes files that have not yet been committed. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table. @@ -421,7 +421,7 @@ You can easily convert a Delta table back to a Parquet table using the following ## Restore a Delta table to an earlier state -You can restore a Delta table to its earlier state by using the `RESTORE` command. A Delta table internally maintains historic versions of the table that enable it to be restored to an earlier state. A version corresponding to the earlier state or a timestamp of when the earlier state was created are supported as options by the RE`STORE command. +You can restore a Delta table to its earlier state by using the `RESTORE` command. A Delta table internally maintains historic versions of the table that enable it to be restored to an earlier state. A version corresponding to the earlier state or a timestamp of when the earlier state was created are supported as options by the `RESTORE` command. @@ -476,7 +476,7 @@ deltaTable.restoreToTimestamp("2019-02-14") // restore to a specific timestamp -Restore is considered a data-changing operation. Delta Lake log entries added by the RESTORE command contain dataChange set to true. If there is a downstream application, such as a Structured streaming job that processes the updates to a Delta Lake table, the data change log entries added by the restore operation are considered as new data updates, and processing them may result in duplicate data. +Restore is considered a data-changing operation. Delta Lake log entries added by the `RESTORE` command contain [dataChange](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file) set to true. If there is a downstream application, such as a [Structured streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) job that processes the updates to a Delta Lake table, the data change log entries added by the restore operation are considered as new data updates, and processing them may result in duplicate data. For example: diff --git a/src/pages/latest/versioning.mdx b/src/pages/latest/versioning.mdx index e6167b0..5586518 100644 --- a/src/pages/latest/versioning.mdx +++ b/src/pages/latest/versioning.mdx @@ -9,7 +9,7 @@ menu: docs The transaction log for a Delta table contains protocol versioning information that supports Delta Lake evolution. Delta Lake tracks minimum [reader and writer -versions](/latest/delta-utility/#detail-schema) separately. +versions](/latest/delta-utility#detail-schema) separately. Delta Lake guarantees _backward compatibility_. A higher protocol version of the @@ -32,7 +32,7 @@ also set the default protocol versions by setting the SQL configurations: To upgrade a table to a newer protocol version, use the `DeltaTable.upgradeTableProtocol` method: - + Protocol version upgrades are irreversible, and upgrading the protocol version may break the existing Delta Lake table readers, writers, or both. Therefore, we recommend you upgrade specific tables only when needed, such as to opt-in to new features in Delta Lake. You should also check to make sure that all of your current and future production tools support Delta Lake tables with the new protocol version. @@ -45,7 +45,7 @@ ALTER TABLE SET TBLPROPERTIES('delta.minReaderVersion' = '1', ```python from delta.tables import DeltaTable delta = DeltaTable.forPath(spark, "path_to_table") # or DeltaTable.forName -delta.upgradeTableProtocol(1, 3) # upgrades to readerVersion=1, writerVersion=3 +delta.upgradeTableProtocol(1, 3) # Upgrades to readerVersion=1, writerVersion=3 ``` ```scala @@ -69,13 +69,17 @@ delta.upgradeTableProtocol(1, 3) // Upgrades to readerVersion=1, writerVersion=3 See [Requirements for Readers](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#requirements-for-readers) and [Writer Version Requirements](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-version-requirements) in the [delta-io/delta](https://github.com/delta-io/delta) repo on the GitHub website. -### Column mapping +## Column mapping -[Column mapping feature](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-mapping) allows Delta table columns and the underlying Parquet file columns to use different names. This enables Delta schema evolution operations such as [RENAME COLUMN](/latest/delta-batch/#rename-columns) on a Delta table without the need to rewrite the underlying Parquet files. It also allows users to name Delta table columns by using [characters that are not allowed](/latest/delta-batch/#use-special-characters-in-column-names) by Parquet, such as spaces, so that users can directly ingest CSV or JSON data into Delta without the need to rename columns due to previous character constraints. + +This feature is available in Delta Lake 1.2.0 and above. This feature is currently experimental with [known limitations](#known-limitations). + + +[Column mapping feature](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-mapping) allows Delta table columns and the underlying Parquet file columns to use different names. This enables Delta schema evolution operations such as [RENAME COLUMN](/latest/delta-batch/#rename-columns) and [DROP COLUMNS](/latest/delta-batch#drop-columns) on a Delta table without the need to rewrite the underlying Parquet files. It also allows users to name Delta table columns by using [characters that are not allowed](/latest/delta-batch/#use-special-characters-in-column-names) by Parquet, such as spaces, so that users can directly ingest CSV or JSON data into Delta without the need to rename columns due to previous character constraints. Column mapping requires upgrading the Delta Lake table protocol. - + Protocol version upgrades are irreversible, and upgrading the protocol version may break the existing Delta Lake table readers, writers, or both. Therefore, we recommend you upgrade specific tables only when needed, such as to opt-in to new features in Delta Lake. You should also check to make sure that all of your current and future production tools support Delta Lake tables with the new protocol version. @@ -92,7 +96,7 @@ ALTER TABLE SET TBLPROPERTIES ( -## Known limitations +### Known limitations - In Delta Lake 2.0.0, [Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) and [Change data feed](/latest/delta-change-data-feed) reads are explicitly blocked on a column mapping enabled table. -- The Delta table protocol specifies two modes of column mapping, by name and by id. Currently in Delta Lake only the name mode is supported. \ No newline at end of file +- The Delta table protocol specifies two modes of column mapping, by `name` and by `id`. Currently in Delta Lake only the `name` mode is supported. \ No newline at end of file