[FEATURE REQUEST] Quick Start: Spark via docker #110

kevinjqliu · 2024-08-07T20:09:59Z

Is your feature request related to a problem? Please describe.

Not a problem. Enhancements to the "Quick Start" guide.

Describe the solution you'd like

Include Spark docker container instead of cloning the Spark GitHub repo as described in the "Quick Start" guide

Describe alternatives you've considered

Possibly use PySpark or other engines (Trino, PyIceberg, etc)

Additional context

Modelled after the Spark and Iceberg Quickstart guide, which is defined in https://github.com/tabular-io/docker-spark-iceberg/tree/main

collado-mike · 2024-09-03T17:38:03Z

There is a jupyter notebook example using docker-compose at https://github.com/apache/polaris/blob/main/docker-compose-jupyter.yml . It's not in the quickstart guide, though.

kevinjqliu · 2024-09-03T18:06:01Z

@collado-mike wdyt about moving that to the getting-started/ folder as part of #42
alternatively, we can create a new example with just spark

kevinjqliu · 2024-09-15T19:54:32Z

Possibly include #192 as setup script

flyrain · 2024-09-18T20:27:52Z

Other than the docker option, ./regtest/run_spark_sql.sh is even a faster option to connect to Polaris with Spark. Here is the usage:

# Usage:
#   ./run_spark_sql.sh [S3-location AWS-IAM-role]
#
# Description:
#   - Without arguments: Runs against a catalog backed by the local filesystem.
#   - With two arguments: Runs against a catalog backed by AWS S3.
#       - [S3-location]  - The S3 path to use as the default base location for the catalog.
#       - [AWS-IAM-role] - The AWS IAM role for catalog to assume when accessing the S3 location.
#
# Examples:
#   - Run against local filesystem:
#     ./run_spark_sql.sh
#
#   - Run against AWS S3:
#     ./run_spark_sql.sh s3://my-bucket/path arn:aws:iam::123456789001:role/my-role

kevinjqliu · 2024-09-18T21:14:34Z

run_spark_sql.sh requires a running Polaris service, right? I want to include it as part of the "getting-started" for Spark.
One option is jupyter, the other is this script which spawns a spark sql shell

flyrain · 2024-09-18T21:27:50Z

run_spark_sql.sh requires a running Polaris service, right?

Yes. Feel free to add it in the doc, and thank you for doing this!

kevinjqliu added the enhancement New feature or request label Aug 7, 2024

kevinjqliu linked a pull request Sep 15, 2024 that will close this issue

Spark Jupyter getting started docker compose #295

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] Quick Start: Spark via docker #110

[FEATURE REQUEST] Quick Start: Spark via docker #110

kevinjqliu commented Aug 7, 2024 •

edited

Loading

collado-mike commented Sep 3, 2024

kevinjqliu commented Sep 3, 2024

kevinjqliu commented Sep 15, 2024

flyrain commented Sep 18, 2024

kevinjqliu commented Sep 18, 2024

flyrain commented Sep 18, 2024

[FEATURE REQUEST] Quick Start: Spark via docker #110

[FEATURE REQUEST] Quick Start: Spark via docker #110

Comments

kevinjqliu commented Aug 7, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

collado-mike commented Sep 3, 2024

kevinjqliu commented Sep 3, 2024

kevinjqliu commented Sep 15, 2024

flyrain commented Sep 18, 2024

kevinjqliu commented Sep 18, 2024

flyrain commented Sep 18, 2024

kevinjqliu commented Aug 7, 2024 •

edited

Loading