[FEATURE]Flint/PPL Tutorial based End2End sample #1010

YANG-DB · 2025-01-08T00:03:32Z

Is your feature request related to a problem?
As part of the need to educate the community and users of how to use flint, ppl and its functionality we would like to introduce a mechanism (framework) that will allow setting up a simple tutorial based experience that will assist users to explore and experiment with Flint , Flint API, PPL, Queries and more.

Containerized Testing Framework

Spark

This guide will get you up and running with OpenSearch Flint using Apache Spark / EMR, including sample code to highlight some powerful features.

We will use docker-compose to generate an End2End running sample containing:

Spark / EMR with Flint's deployed job
OpenSearch server container
OpenSearch Dashboards container
S3 alike (Minio) container

The Spark container is configured with both the Flint and PPL extensions, enabling it to both execute PPL queries and query indices on the OpenSearch server.

  spark:
    image: bitnami/spark:${SPARK_VERSION:-3.5.3}
    container_name: spark
    ports:
      - "${MASTER_UI_PORT:-8080}:8080"
      - "${MASTER_PORT:-7077}:7077"
      - "${UI_PORT:-4040}:4040"
      - "${SPARK_CONNECT_PORT}:15002"
    entrypoint: /opt/bitnami/scripts/spark/master-entrypoint.sh
    user: root
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_PUBLIC_DNS=localhost
      - AWS_ENDPOINT_URL_S3=http://minio-S3
      - OPENSEARCH_ADMIN_PASSWORD=${OPENSEARCH_ADMIN_PASSWORD}
    volumes:
      - type: bind
        source: ./spark-master-entrypoint.sh
        target: /opt/bitnami/scripts/spark/master-entrypoint.sh
      - type: bind
        source: ./spark-defaults.conf
        target: /opt/bitnami/spark/conf/spark-defaults.conf
      - type: bind
        source: ./log4j2.properties
        target: /opt/bitnami/spark/conf/log4j2.properties
      - type: bind
        source: $PPL_JAR
        target: /opt/bitnami/spark/jars/ppl-spark-integration.jar
      - type: bind
        source: $FLINT_JAR
        target: /opt/bitnami/spark/jars/flint-spark-integration.jar
      - type: bind
        source: ./s3.credentials
        target: /opt/bitnami/spark/s3.credentials

The OpenSearch Dashboards container is configured to connect to the OpenSearch server container.

The Spark container is started up as a driver and runs the Spark application.

Spark uses minio as an S3 compliant object store allowing flint to query long term storage locally.

spark.datasource.flint.auth           basic
spark.datasource.flint.auth.username  admin
spark.datasource.flint.auth.password  C0rrecthorsebatterystaple.
spark.sql.warehouse.dir               s3a://integ-test/
spark.hadoop.fs.s3a.impl              org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.access.key        Vt7jnvi5BICr1rkfsheT
spark.hadoop.fs.s3a.secret.key        5NK3StGvoGCLUWvbaGN0LBUf9N6sjE94PEzLdqwO
spark.hadoop.fs.s3a.endpoint          minio-S3:9000
spark.hadoop.fs.s3a.connection.ssl.enabled false

Jupiter Notebook based tutorial

Using the following Dockerfile to add support for the Jupyter notebook and tutorial folder library

FROM python:3.10-bullseye

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
      sudo \
      curl \
      vim \
      unzip \
      openjdk-11-jdk \
      build-essential \
      software-properties-common \
      ssh && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip3 install -r requirements.txt

RUN python3 -m spylon_kernel install

RUN curl https://github.com/SpencerPark/IJava/releases/download/v1.3.0/ijava-1.3.0.zip -Lo ijava-1.3.0.zip \
  && unzip ijava-1.3.0.zip \
  && python3 install.py --sys-prefix \
  && rm ijava-1.3.0.zip

# Optional env variables
ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"}
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH

WORKDIR ${SPARK_HOME}

ENV SPARK_VERSION=3.5.2
ENV SPARK_MAJOR_VERSION=3.5
ENV ICEBERG_VERSION=1.6.0

# Download spark
RUN mkdir -p ${SPARK_HOME} \
 && curl https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
 && tar xvzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
 && rm -rf spark-${SPARK_VERSION}-bin-hadoop3.tgz

# Add spark runtime jar to IJava classpath
ENV IJAVA_CLASSPATH=/opt/spark/jars/*

RUN mkdir -p /home/demo/data \
 && curl https://data.cityofnewyork.us/resource/tg4x-b46p.json > /home/iceberg/data/nyc_film_permits.json \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-04.parquet -o /home/iceberg/data/yellow_tripdata_2022-04.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet -o /home/iceberg/data/yellow_tripdata_2022-03.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet -o /home/iceberg/data/yellow_tripdata_2022-02.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -o /home/iceberg/data/yellow_tripdata_2022-01.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-12.parquet -o /home/iceberg/data/yellow_tripdata_2021-12.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-11.parquet -o /home/iceberg/data/yellow_tripdata_2021-11.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-10.parquet -o /home/iceberg/data/yellow_tripdata_2021-10.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-09.parquet -o /home/iceberg/data/yellow_tripdata_2021-09.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-08.parquet -o /home/iceberg/data/yellow_tripdata_2021-08.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-07.parquet -o /home/iceberg/data/yellow_tripdata_2021-07.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-06.parquet -o /home/iceberg/data/yellow_tripdata_2021-06.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-05.parquet -o /home/iceberg/data/yellow_tripdata_2021-05.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-04.parquet -o /home/iceberg/data/yellow_tripdata_2021-04.parquet

RUN mkdir -p /home/demo/localwarehouse /home/demo/notebooks /home/demo/warehouse /home/demo/spark-events /home/demo
COPY notebooks/ /home/demo/notebooks

# Add a notebook command
RUN echo '#! /bin/sh' >> /bin/notebook \
 && echo 'export PYSPARK_DRIVER_PYTHON=jupyter-notebook' >> /bin/notebook \
 && echo "export PYSPARK_DRIVER_PYTHON_OPTS=\"--notebook-dir=/home/demo/notebooks --ip='*' --NotebookApp.token='' --NotebookApp.password='' --port=8888 --no-browser --allow-root\"" >> /bin/notebook \
 && echo "pyspark" >> /bin/notebook \
 && chmod u+x /bin/notebook

# Add a pyspark-notebook command (alias for notebook command for backwards-compatibility)
RUN echo '#! /bin/sh' >> /bin/pyspark-notebook \
 && echo 'export PYSPARK_DRIVER_PYTHON=jupyter-notebook' >> /bin/pyspark-notebook \
 && echo "export PYSPARK_DRIVER_PYTHON_OPTS=\"--notebook-dir=/home/demo/notebooks --ip='*' --NotebookApp.token='' --NotebookApp.password='' --port=8888 --no-browser --allow-root\"" >> /bin/pyspark-notebook \
 && echo "pyspark" >> /bin/pyspark-notebook \
 && chmod u+x /bin/pyspark-notebook

COPY spark-defaults.conf /opt/spark/conf
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"

RUN chmod u+x /opt/spark/sbin/* && \
    chmod u+x /opt/spark/bin/*

COPY .pyiceberg.yaml /root/.pyiceberg.yaml

COPY entrypoint.sh .

ENTRYPOINT ["./entrypoint.sh"]
CMD ["notebook"]

The /home/demo/data mapped volume would contain the list of python jupyter notebook tutorials to get started working with Flint / PPL using spark

- An Introduction to the Flint API.ipynb
- PPL Getting Started.ipynb
- PPL Data Projections.ipynb
- SQL Data Accelerations.ipynb

NYC Taxi Dataset

The NYC Taxi Dataset provides a rich source of real-world data for experimentation with Flint, PPL, and Spark. This dataset includes yellow taxi trip records, including pickup and drop-off times, locations, trip distances, fare amounts, and other relevant metadata.
This dataset is used for demonstrating Flint's capabilities in querying, data indexing, and analytics both for SQL & PPL.

Data Setup

The NYC Taxi Dataset is included in the Docker setup as .parquet files located in the /home/demo/data directory of the container. Each file corresponds to a specific month and year, enabling experimentation with partitioned data and time-series queries.

The .parquet files are preloaded for the following months:

2021: April to December
2022: January to April

These files can be accessed from Spark or directly via Minio (S3-alike object storage).

Tutorials Featuring NYC Taxi Dataset

The dataset is used as the basis for hands-on tutorials available in the /home/demo/notebooks folder:

An Introduction to the Flint API.ipynb: Learn how to query and manipulate data.
PPL Getting Started.ipynb: Explore Flint's PPL capabilities with real-world data.
PPL Data Projections.ipynb: Project and filter key metrics from the dataset.
SQL Data Accelerations.ipynb: Accelerate data processing with OpenSearch indices using Flint optimizations.

General purpose testing facilities

To enhance flexibility and support a wide range of use cases, the Docker setup includes a general-purpose data folder located at /home/demo/data.
This folder is designed to house datasets and accompanying resources tailored for specific tutorials and learning scenarios. Each dataset resides in its own subfolder, containing:

Dataset Files: The raw or preprocessed data required for the tutorial, such as .parquet, .csv, or .json files.

Loading Script: A Jupyter Notebook (load_dataset.ipynb) that demonstrates how to load and prepare the dataset using Spark or other tools.

Tutorial-Specific Notebooks: A collection of Jupyter Notebooks designed to guide users through specific functionalities and use cases related to Flint, PPL, or Spark.

These notebooks provide step-by-step instructions for tasks such as querying, data transformation, and visualization.

Example Structure
For the NYC Taxi Dataset, the folder structure would look like this:

/home/demo/data/nyc_taxi/
  ├── yellow_tripdata_2021-12.parquet
  ├── yellow_tripdata_2022-01.parquet
  ├── load_dataset.ipynb
  ├── An_Introduction_to_the_Flint_API.ipynb
  ├── PPL_Getting_Started.ipynb
  ├── PPL_Data_Projections.ipynb
  └── SQL_Data_Accelerations.ipynb

Do you have any additional context?

The text was updated successfully, but these errors were encountered:

YANG-DB added enhancement New feature or request untriaged infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. testing test related feature and removed untriaged labels Jan 8, 2025

YANG-DB mentioned this issue Jan 8, 2025

update spark-docker example with jupyter tutorial & notebook #1001

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]Flint/PPL Tutorial based End2End sample #1010

[FEATURE]Flint/PPL Tutorial based End2End sample #1010

YANG-DB commented Jan 8, 2025

[FEATURE]Flint/PPL Tutorial based End2End sample #1010

[FEATURE]Flint/PPL Tutorial based End2End sample #1010

Comments

YANG-DB commented Jan 8, 2025

Containerized Testing Framework

Spark

Jupiter Notebook based tutorial

NYC Taxi Dataset

Data Setup

Tutorials Featuring NYC Taxi Dataset

General purpose testing facilities