-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Doc: Remove Spark 3 specific wordings in docs #14357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
eb67961
b0d711a
606d6f1
5825e88
5033945
6efed89
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,7 +24,7 @@ To use Iceberg in Spark, first configure [Spark catalogs](spark-configuration.md | |
|
|
||
| ## Querying with SQL | ||
|
|
||
| In Spark 3, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs). | ||
| In Spark, tables use identifiers that include a [catalog name](spark-configuration.md#using-catalogs). | ||
|
|
||
| ```sql | ||
| SELECT * FROM prod.db.table; -- catalog: prod, namespace: db, table: table | ||
|
|
@@ -45,7 +45,7 @@ SELECT * FROM prod.db.table.files; | |
| | 0 | s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 03} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] | null | null | | ||
|
|
||
| ### Time travel Queries with SQL | ||
| Spark 3.3 and later supports time travel in SQL queries using `TIMESTAMP AS OF` or `VERSION AS OF` clauses. | ||
| Spark supports time travel in SQL queries using `TIMESTAMP AS OF` or `VERSION AS OF` clauses. | ||
| The `VERSION AS OF` clause can contain a long snapshot ID or a string branch or tag name. | ||
|
|
||
| !!! info | ||
|
|
@@ -180,10 +180,6 @@ spark.read | |
| .load("path/to/table") | ||
| ``` | ||
|
|
||
| !!! info | ||
| Spark 3.0 and earlier versions do not support using `option` with `table` in DataFrameReader commands. All options will be silently | ||
| ignored. Do not use `table` when attempting to time-travel or use other options. See [SPARK-32592](https://issues.apache.org/jira/browse/SPARK-32592). | ||
|
|
||
|
Comment on lines
-183
to
-186
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: i think we should keep this warning |
||
| ### Incremental read | ||
|
|
||
| To read appended data incrementally, use: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -76,8 +76,6 @@ data.writeStream | |
| .toTable("database.table_name") | ||
| ``` | ||
|
|
||
| If you're using Spark 3.0 or earlier, you need to use `.option("path", "database.table_name").start()`, instead of `.toTable("database.table_name")`. | ||
|
|
||
|
Comment on lines
-79
to
-80
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: we should keep this warning in case someone is still using spark 3.0 or earlier |
||
| In the case of the directory-based Hadoop catalog: | ||
|
|
||
| ```scala | ||
|
|
@@ -101,7 +99,7 @@ Iceberg doesn't support experimental [continuous processing](https://spark.apach | |
|
|
||
| ### Partitioned table | ||
|
|
||
| Iceberg requires sorting data by partition per task prior to writing the data. In Spark tasks are split by Spark partition. | ||
| Iceberg requires sorting data by partition per task prior to writing the data. In Spark tasks are split by Spark partition | ||
| against partitioned table. For batch queries you're encouraged to do explicit sort to fulfill the requirement | ||
| (see [here](spark-writes.md#writing-distribution-modes)), but the approach would bring additional latency as | ||
| repartition and sort are considered as heavy operations for streaming workload. To avoid additional latency, you can | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -274,7 +274,7 @@ This configuration creates a path-based catalog named `local` for tables under ` | |
| === "CLI" | ||
|
|
||
| ```sh | ||
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }}\ | ||
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}\ | ||
| --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ | ||
| --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ | ||
| --conf spark.sql.catalog.spark_catalog.type=hive \ | ||
|
|
@@ -287,7 +287,7 @@ This configuration creates a path-based catalog named `local` for tables under ` | |
| === "spark-defaults.conf" | ||
|
|
||
| ```sh | ||
| spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }} | ||
| spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }} | ||
| spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions | ||
| spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog | ||
| spark.sql.catalog.spark_catalog.type hive | ||
|
|
@@ -309,27 +309,27 @@ If you already have a Spark environment, you can add Iceberg, using the `--packa | |
| === "SparkSQL" | ||
|
|
||
| ```sh | ||
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }} | ||
| spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }} | ||
| ``` | ||
|
|
||
| === "Spark-Shell" | ||
|
|
||
| ```sh | ||
| spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }} | ||
| spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }} | ||
| ``` | ||
|
|
||
| === "PySpark" | ||
|
|
||
| ```sh | ||
| pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{ icebergVersion }} | ||
| pyspark --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }} | ||
| ``` | ||
|
|
||
| !!! note | ||
| If you want to include Iceberg in your Spark installation, add the Iceberg Spark runtime to Spark's `jars` folder. | ||
| You can download the runtime by visiting to the [Releases](releases.md) page. | ||
|
|
||
| <!-- markdown-link-check-disable-next-line --> | ||
| [spark-runtime-jar]: https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{ icebergVersion }}/iceberg-spark-runtime-3.5_2.12-{{ icebergVersion }}.jar | ||
| [spark-runtime-jar]: https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-{{ sparkVersionMajor }}/{{ icebergVersion }}/iceberg-spark-runtime-{{ sparkVersionMajor }}-{{ icebergVersion }}.jar | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: this isnt rendered |
||
|
|
||
| #### Learn More | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please share a snapshot of this page after the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.