Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Quick Start: Spark via docker #110

Open
kevinjqliu opened this issue Aug 7, 2024 · 6 comments · May be fixed by #295
Open

[FEATURE REQUEST] Quick Start: Spark via docker #110

kevinjqliu opened this issue Aug 7, 2024 · 6 comments · May be fixed by #295
Labels
enhancement New feature or request

Comments

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Aug 7, 2024

Is your feature request related to a problem? Please describe.

Not a problem. Enhancements to the "Quick Start" guide.

Describe the solution you'd like

Include Spark docker container instead of cloning the Spark GitHub repo as described in the "Quick Start" guide

Describe alternatives you've considered

Possibly use PySpark or other engines (Trino, PyIceberg, etc)

Additional context

Modelled after the Spark and Iceberg Quickstart guide, which is defined in https://github.com/tabular-io/docker-spark-iceberg/tree/main

@kevinjqliu kevinjqliu added the enhancement New feature or request label Aug 7, 2024
@collado-mike
Copy link
Contributor

There is a jupyter notebook example using docker-compose at https://github.com/apache/polaris/blob/main/docker-compose-jupyter.yml . It's not in the quickstart guide, though.

@kevinjqliu
Copy link
Contributor Author

@collado-mike wdyt about moving that to the getting-started/ folder as part of #42
alternatively, we can create a new example with just spark

@kevinjqliu kevinjqliu linked a pull request Sep 15, 2024 that will close this issue
10 tasks
@kevinjqliu
Copy link
Contributor Author

Possibly include #192 as setup script

@flyrain
Copy link
Contributor

flyrain commented Sep 18, 2024

Other than the docker option, ./regtest/run_spark_sql.sh is even a faster option to connect to Polaris with Spark. Here is the usage:

# Usage:
#   ./run_spark_sql.sh [S3-location AWS-IAM-role]
#
# Description:
#   - Without arguments: Runs against a catalog backed by the local filesystem.
#   - With two arguments: Runs against a catalog backed by AWS S3.
#       - [S3-location]  - The S3 path to use as the default base location for the catalog.
#       - [AWS-IAM-role] - The AWS IAM role for catalog to assume when accessing the S3 location.
#
# Examples:
#   - Run against local filesystem:
#     ./run_spark_sql.sh
#
#   - Run against AWS S3:
#     ./run_spark_sql.sh s3://my-bucket/path arn:aws:iam::123456789001:role/my-role

@kevinjqliu
Copy link
Contributor Author

run_spark_sql.sh requires a running Polaris service, right? I want to include it as part of the "getting-started" for Spark.
One option is jupyter, the other is this script which spawns a spark sql shell

@flyrain
Copy link
Contributor

flyrain commented Sep 18, 2024

run_spark_sql.sh requires a running Polaris service, right?

Yes. Feel free to add it in the doc, and thank you for doing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants