-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(RFC) Integration Tests on Apache Spark and Spark EMR #992
Comments
I did some more testing and found another approach that could work. Create a simple Spark app.
For the integration test to use this, a special SparkSession is created. Start the docker cluster before any tests are run. When
The integration tests would then be able to quickly switch between Apache Spark and Spark EMR. For Apache Spark, Spark Connect is used. For Spark EMR, the new SparkSession is used. One downside is that the starting the docker container for each query will have more overhead and add latency. The container is reused though, and not recreated for each query. |
@normanj-bitquill
This use case will contains:
Step1:
Step2: Query Endpoint and Syntax With OpenSearch security plugin disabled
|
As discussed offline: Glue is an AWS service. There are docker images of Spark that include the Glue libraries. These images are able to use catalogs from Glue by making remote calls to AWS Glue. I cannot find a docker image of an AWS Glue server. It may be possible to mock out the AWS Glue server, but another option is to use Hive and S3 (minio) instead. |
I have tested configuring Spark to use the Minio server for storing the data store. This works, but I doubt that it provides any extra value when running queries in Spark. Spark uses Hadoop to access S3. The only OpenSearch code that could be involved in this setup is the PPL extension. The Flint extension requires async query working on OpenSearch. I have looked further into adding an S3 datasource in OpenSearch. In the docker environment, I was able to add an S3 Glue datasource. When trying to run an async query, it fails when OpenSearch tries to make a call using the A possibly simple solution is to create a Jar file with replacements for the classes For our testing, we want Spark to have an OpenSearch catalog that uses async query on OpenSearch. |
Working through a quick test of the above, shows some promise. For my testing, I started with an Next steps:
|
thanks @normanj-bitquill - sound like you have made progress!! I would even simplify this by adding an environment param for selecting the implementation of the I think you can add such code code in the
Let me know what you think ? |
@YANG-DB The only issue here is the time to get the change you suggested into an OpenSearch docker image that the
This is the correct way forward, but continue forward in the near term, I'll use an altered |
This work has been merged in. There is some documentation in the repository: The real goal was to be able to test the Async API using only local resources. Normally Spark EMR is used in processing Async API queries. The At present, using Apache Spark containers with the PPL and Flint integration extensions is sufficient for running an async query. The Spark application is the When OpenSearch would have called AWS EMR to run a query on a Spark EMR container, it will instead make a docker call to start an Apache Spark container. Async API queries have sessions. When a new session is started, OpenSearch will start a new Apache Spark container. The container will continue processing queries for the session for 3 minutes and then shutdown. If more queries for the session are received, a new container is started. It is recommended to reuses sessions as much as possible. The |
@normanj-bitquill thanks for the detailed review ! |
Problem Overview
Work is underway to create the files needed for starting Docker clusters that can be used for integration testing. There will initially be two clusters, one for testing with Apache Spark and one for testing with Spark EMR.
The integration tests should be able switch between the two clusters (and any future clusters) without any changes.
The integration tests will run from either SBT or a standalone script. This to allow setting up CI steps for running the integration tests as well as running them locally.
Need an execution model for the integration tests that will work with both Apache Spark and Spark EMR.
Proposed Solution
Structure the tests as a set of queries. Each query will have an expected query plan, and expected results (if the query succeeds). These tests can be made available to the Spark container in a bound directory. There will be another bound directory for holding the test results and query plans.
The bound directories are:
/tests/queries
- Each query to run is in a separate file/tests/actual_query_plans
/tests/actual_results
A Spark application is created that runs the integration tests. The application will look in the directory
/tests/queries
. For each query file that it finds, it will:/tests/actual_results
/tests/actual_query_plans
The SBT build is updated for the integration test phase to do the following:
This solution does not involve connecting remotely to the Spark container. Since Spark is only running a Spark application, the solution will work for both Apache Spark and Spark EMR.
Docker Clusters
Each Docker cluster will contain the following:
The Spark container is configured with both the Flint and PPL extensions, enabling it to both execute PPL queries and query indices on the OpenSearch server.
The OpenSearch Dashboards container is configured to connect to the OpenSearch server container.
The Spark container is started up as a driver and runs the Spark application.
The text was updated successfully, but these errors were encountered: