Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/docs/spark-getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,17 @@ Spark is currently the most feature-rich compute engine for Iceberg operations.
We recommend you to get started with Spark to understand Iceberg concepts and features with examples.
You can also view documentations of using Iceberg with other compute engine under the [Multi-Engine Support](../../multi-engine-support.md) page.

## Using Iceberg in Spark 3
## Using Iceberg in Spark

To use Iceberg in a Spark shell, use the `--packages` option:

```sh
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}
```

!!! info
<!-- markdown-link-check-disable-next-line -->
If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-3.5_2.12` Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.5_2.12-{{ icebergVersion }}.jar) to Spark's `jars` folder.
If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-{{ sparkVersionMajor }}` Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-{{ sparkVersionMajor }}/{{ icebergVersion }}/iceberg-spark-runtime-{{ sparkVersionMajor }}-{{ icebergVersion }}.jar) to Spark's `jars` folder.

### Adding catalogs

Expand All @@ -45,7 +45,7 @@ Iceberg comes with [catalogs](spark-configuration.md#catalogs) that enable SQL c
This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog:

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
Expand Down
4 changes: 3 additions & 1 deletion docs/docs/spark-procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ title: "Procedures"

# Spark Procedures

To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md). Stored procedures are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3.
To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md).
For Spark 3.x, stored procedures are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share a snapshot of this page after the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

For Spark 4.0, stored procedures are supported natively without requiring the Iceberg SQL extensions. However, note that they are __case-sensitive__ in Spark 4.0.

## Usage

Expand Down
8 changes: 2 additions & 6 deletions docs/docs/spark-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md

## Querying with SQL

In Spark 3, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs).
In Spark, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs).

```sql
SELECT * FROM prod.db.table; -- catalog: prod, namespace: db, table: table
Expand All @@ -45,7 +45,7 @@ SELECT * FROM prod.db.table.files;
| 0 | s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 03} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] | null | null |

### Time travel Queries with SQL
Spark 3.3 and later supports time travel in SQL queries using `TIMESTAMP AS OF` or `VERSION AS OF` clauses.
Spark supports time travel in SQL queries using `TIMESTAMP AS OF` or `VERSION AS OF` clauses.
The `VERSION AS OF` clause can contain a long snapshot ID or a string branch or tag name.

!!! info
Expand Down Expand Up @@ -180,10 +180,6 @@ spark.read
.load("path/to/table")
```

!!! info
Spark 3.0 and earlier versions do not support using `option` with `table` in DataFrameReader commands. All options will be silently
ignored. Do not use `table` when attempting to time-travel or use other options. See [SPARK-32592](https://issues.apache.org/jira/browse/SPARK-32592).

Comment on lines -183 to -186
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think we should keep this warning

### Incremental read

To read appended data incrementally, use:
Expand Down
4 changes: 1 addition & 3 deletions docs/docs/spark-structured-streaming.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,6 @@ data.writeStream
.toTable("database.table_name")
```

If you're using Spark 3.0 or earlier, you need to use `.option("path", "database.table_name").start()`, instead of `.toTable("database.table_name")`.

Comment on lines -79 to -80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should keep this warning in case someone is still using spark 3.0 or earlier

In the case of the directory-based Hadoop catalog:

```scala
Expand All @@ -101,7 +99,7 @@ Iceberg doesn't support experimental [continuous processing](https://spark.apach

### Partitioned table

Iceberg requires sorting data by partition per task prior to writing the data. In Spark tasks are split by Spark partition.
Iceberg requires sorting data by partition per task prior to writing the data. In Spark tasks are split by Spark partition
against partitioned table. For batch queries you're encouraged to do explicit sort to fulfill the requirement
(see [here](spark-writes.md#writing-distribution-modes)), but the approach would bring additional latency as
repartition and sort are considered as heavy operations for streaming workload. To avoid additional latency, you can
Expand Down
34 changes: 17 additions & 17 deletions docs/docs/spark-writes.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,25 +22,25 @@ title: "Writes"

To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md).

Some plans are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions) in Spark 3.
Some plans are only available when using [Iceberg SQL extensions](spark-configuration.md#sql-extensions).

Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions:

| Feature support | Spark 3 | Notes |
|--------------------------------------------------|-----------|-----------------------------------------------------------------------------|
| [SQL insert into](#insert-into) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) |
| [SQL merge into](#merge-into) | ✔️ | ⚠ Requires Iceberg Spark extensions |
| [SQL insert overwrite](#insert-overwrite) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) |
| [SQL delete from](#delete-from) | ✔️ | ⚠ Row-level delete requires Iceberg Spark extensions |
| [SQL update](#update) | ✔️ | ⚠ Requires Iceberg Spark extensions |
| [DataFrame append](#appending-data) | ✔️ | |
| [DataFrame overwrite](#overwriting-data) | ✔️ | |
| [DataFrame CTAS and RTAS](#creating-tables) | ✔️ | ⚠ Requires DSv2 API |
| [DataFrame merge into](#merging-data) | ✔️ | ⚠ Requires DSv2 API (Spark 4.0 and later) |
| Feature support | Spark | Notes |
|--------------------------------------------------|---------|-----------------------------------------------------------------------------|
| [SQL insert into](#insert-into) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) |
| [SQL merge into](#merge-into) | ✔️ | ⚠ Requires Iceberg Spark extensions |
| [SQL insert overwrite](#insert-overwrite) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) |
| [SQL delete from](#delete-from) | ✔️ | ⚠ Row-level delete requires Iceberg Spark extensions |
| [SQL update](#update) | ✔️ | ⚠ Requires Iceberg Spark extensions |
| [DataFrame append](#appending-data) | ✔️ | |
| [DataFrame overwrite](#overwriting-data) | ✔️ | |
| [DataFrame CTAS and RTAS](#creating-tables) | ✔️ | ⚠ Requires DSv2 API |
| [DataFrame merge into](#merging-data) | ✔️ | ⚠ Requires DSv2 API (Spark 4.0 and later) |

## Writing with SQL

Spark 3 supports SQL `INSERT INTO`, `MERGE INTO`, and `INSERT OVERWRITE`, as well as the new `DataFrameWriterV2` API.
Spark supports SQL `INSERT INTO`, `MERGE INTO`, and `INSERT OVERWRITE`, as well as the new `DataFrameWriterV2` API.

### `INSERT INTO`

Expand All @@ -55,7 +55,7 @@ INSERT INTO prod.db.table SELECT ...

### `MERGE INTO`

Spark 3 added support for `MERGE INTO` queries that can express row-level updates.
Spark supports `MERGE INTO` queries that can express row-level updates.

Iceberg supports `MERGE INTO` by rewriting data files that contain rows that need to be updated in an `overwrite` commit.

Expand Down Expand Up @@ -161,7 +161,7 @@ Note that this mode cannot replace hourly partitions like the dynamic example qu

### `DELETE FROM`

Spark 3 added support for `DELETE FROM` queries to remove data from tables.
Spark supports `DELETE FROM` queries to remove data from tables.

Delete queries accept a filter to match rows to delete.

Expand Down Expand Up @@ -253,7 +253,7 @@ data.writeTo("prod.db.table.branch_audit").overwritePartitions()

## Writing with DataFrames

Spark 3 introduced the new `DataFrameWriterV2` API for writing to tables using data frames. The v2 API is recommended for several reasons:
Spark introduced the new `DataFrameWriterV2` API for writing to tables using data frames. The v2 API is recommended for several reasons:

* CTAS, RTAS, and overwrite by filter are supported
* All operations consistently write columns to a table by name
Expand All @@ -268,7 +268,7 @@ Spark 3 introduced the new `DataFrameWriterV2` API for writing to tables using d
The v1 DataFrame `write` API is still supported, but is not recommended.

!!! danger
When writing with the v1 DataFrame API in Spark 3, use `saveAsTable` or `insertInto` to load tables with a catalog.
When writing with the v1 DataFrame API in Spark, use `saveAsTable` or `insertInto` to load tables with a catalog.
Using `format("iceberg")` loads an isolated table reference that will not automatically refresh tables used by queries.

### Appending data
Expand Down
12 changes: 6 additions & 6 deletions site/docs/spark-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,7 +274,7 @@ This configuration creates a path-based catalog named `local` for tables under `
=== "CLI"

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
Expand All @@ -287,7 +287,7 @@ This configuration creates a path-based catalog named `local` for tables under `
=== "spark-defaults.conf"

```sh
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type hive
Expand All @@ -309,27 +309,27 @@ If you already have a Spark environment, you can add Iceberg, using the `--packa
=== "SparkSQL"

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}
```

=== "Spark-Shell"

```sh
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}
```

=== "PySpark"

```sh
pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}
pyspark --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}
```

!!! note
If you want to include Iceberg in your Spark installation, add the Iceberg Spark runtime to Spark's `jars` folder.
You can download the runtime by visiting to the [Releases](releases.md) page.

<!-- markdown-link-check-disable-next-line -->
[spark-runtime-jar]: https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.5_2.12-{{ icebergVersion }}.jar
[spark-runtime-jar]: https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-{{ sparkVersionMajor }}/{{ icebergVersion }}/iceberg-spark-runtime-{{ sparkVersionMajor }}-{{ icebergVersion }}.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isnt rendered
same problem on https://iceberg.apache.org/spark-quickstart/#learn-more right now


#### Learn More

Expand Down
1 change: 1 addition & 0 deletions site/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ extra:
nessieVersion: '0.104.5'
flinkVersion: '2.0.0'
flinkVersionMajor: '2.0'
sparkVersionMajor: '4.0_2.13'
social:
- icon: fontawesome/regular/comments
link: 'https://iceberg.apache.org/community/'
Expand Down