[FEATURE]Flint/PPL Tutorial based End2End sample #1010
Labels
enhancement
New feature or request
infrastructure
Changes to infrastructure, testing, CI/CD, pipelines, etc.
testing
test related feature
Is your feature request related to a problem?
As part of the need to educate the community and users of how to use flint, ppl and its functionality we would like to introduce a mechanism (framework) that will allow setting up a simple tutorial based experience that will assist users to explore and experiment with Flint , Flint API, PPL, Queries and more.
Containerized Testing Framework
Spark
This guide will get you up and running with OpenSearch Flint using Apache Spark / EMR, including sample code to highlight some powerful features.
We will use docker-compose to generate an End2End running sample containing:
Minio
) containerThe Spark container is configured with both the Flint and PPL extensions, enabling it to both execute PPL queries and query indices on the OpenSearch server.
The OpenSearch Dashboards container is configured to connect to the OpenSearch server container.
The Spark container is started up as a driver and runs the Spark application.
Spark uses
minio
as an S3 compliant object store allowing flint to query long term storage locally.Jupiter Notebook based tutorial
Using the following Dockerfile to add support for the Jupyter notebook and tutorial folder library
The /home/demo/data mapped volume would contain the list of python jupyter notebook tutorials to get started working with Flint / PPL using spark
- An Introduction to the Flint API.ipynb
- PPL Getting Started.ipynb
- PPL Data Projections.ipynb
- SQL Data Accelerations.ipynb
NYC Taxi Dataset
The NYC Taxi Dataset provides a rich source of real-world data for experimentation with Flint, PPL, and Spark. This dataset includes yellow taxi trip records, including pickup and drop-off times, locations, trip distances, fare amounts, and other relevant metadata.
This dataset is used for demonstrating Flint's capabilities in querying, data indexing, and analytics both for SQL & PPL.
Data Setup
The NYC Taxi Dataset is included in the Docker setup as
.parquet
files located in the/home/demo/data
directory of the container. Each file corresponds to a specific month and year, enabling experimentation with partitioned data and time-series queries.The
.parquet
files are preloaded for the following months:These files can be accessed from Spark or directly via
Minio
(S3-alike object storage).Tutorials Featuring NYC Taxi Dataset
The dataset is used as the basis for hands-on tutorials available in the /home/demo/notebooks folder:
An Introduction to the Flint API.ipynb: Learn how to query and manipulate data.
PPL Getting Started.ipynb
: Explore Flint's PPL capabilities with real-world data.PPL Data Projections.ipynb
: Project and filter key metrics from the dataset.SQL Data Accelerations.ipynb
: Accelerate data processing with OpenSearch indices using Flint optimizations.General purpose testing facilities
To enhance flexibility and support a wide range of use cases, the Docker setup includes a general-purpose data folder located at /home/demo/data.
This folder is designed to house datasets and accompanying resources tailored for specific tutorials and learning scenarios. Each dataset resides in its own subfolder, containing:
Dataset Files: The raw or preprocessed data required for the tutorial, such as
.parquet
,.csv
, or.json
files.Loading Script: A Jupyter Notebook (
load_dataset.ipynb
) that demonstrates how to load and prepare the dataset using Spark or other tools.Tutorial-Specific Notebooks: A collection of Jupyter Notebooks designed to guide users through specific functionalities and use cases related to Flint, PPL, or Spark.
These notebooks provide step-by-step instructions for tasks such as querying, data transformation, and visualization.
Example Structure
For the NYC Taxi Dataset, the folder structure would look like this:
Do you have any additional context?
The text was updated successfully, but these errors were encountered: