Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FlintJob to handle all query types in warmpool mode #979

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

saranrajnk
Copy link

@saranrajnk saranrajnk commented Dec 9, 2024

Description

This PR introduces support for FlintJob to handle all types of queries — interactive, streaming, and batch — with all data sources in warmpool mode. Additionally, FlintJob will also support non-warmpool mode for streaming and batch queries, configurable via a Spark configuration setting.

FlintJob invokes Warmpool.scala, which in turn calls the client to continuously fetch queries for execution. The client sets various Spark configurations, such as the datasource, resultIndex, and other parameters. It also controls when to terminate the loop and stop the job. When a valid query is received, the JobOperator flow is triggered to execute the query and write the results accordingly.

Changes:

  • Introduces a new file, Warmpool.scala, which repeatedly calls getNextStatement() in a loop.
  • Adds support in JobOperator to write the query results either to QueryResultWriter or an OpenSearch Index, depending on the job type.
  • Implements the emission of success, failure, and latency metrics within JobOperator.

Related Issues

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Collaborator

@ykmr1224 ykmr1224 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify and document how WarmPool is abstracted and can be enabled/disabled?

Comment on lines 550 to 644
def getSegmentName(sparkSession: SparkSession): String = {
val maxExecutorsCount =
sparkSession.conf.get(FlintSparkConf.MAX_EXECUTORS_COUNT.key, "unknown")
String.format("%se", maxExecutorsCount)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This segmentName is specific to warmpool logic; let us create abstractions on warmpool and record metrics via AOP.

@saranrajnk saranrajnk force-pushed the nexus-wp-feat branch 3 times, most recently from 044aeea to adef5b6 Compare December 20, 2024 20:43
@noCharger noCharger added the 0.7 label Jan 2, 2025
Copy link
Collaborator

@noCharger noCharger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the concept of interactive / batch / streaming job for warm pool?

}
}

def queryLoop(commandContext: CommandContext): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the concept of query loop for warm pool?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warmpool requires multiple iterations as well before running the actual query.


// osClient needs spark session to be created first to get FlintOptions initialized.
// Otherwise, we will have connection exception from EMR-S to OS.
val osClient = new OSClient(FlintSparkConf().flintOptions())

// QueryResultWriter depends on sessionManager to fetch the sessionContext
val sessionManager = instantiateSessionManager(sparkSession, Some(resultIndex))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since JobOperator needs to support interactive queries, QueryResultWriter will be used. QueryResultWriterImpl, which handles the writing of query results, depends on sessionManager.

That's why sessionManager is being introduced here to satisfy this dependency (for interactive queries)

Reference: https://github.com/opensearch-project/opensearch-spark/blob/main/spark-sql-application/src/main/scala/org/apache/spark/sql/QueryResultWriterImpl.scala#L20

Signed-off-by: Shri Saran Raj N <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants