-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add qualification support for Photon jobs in the Python Tool #1409
base: dev
Are you sure you want to change the base?
Add qualification support for Photon jobs in the Python Tool #1409
Conversation
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthosa !
Just for sake of confirmation:
- Is there another followup PR to change the QualX module to read the app_meta.json to decide whether this app is photon or not? In that case the PR description is not accurate because it gives impression that it adds support e-2-e.
- I am concerned about how we can troubleshoot and validate app_meta.json. the wrapper reads the autotuner's output and copy some of the fields to that file in the upper level. With this PR, we are adding a new field derived from python logic. Later, we will hit a question "Where does each field come from?" (this becomes even more challenging if fields might be overridden by Python wrapper). CC: @tgravescs
if self.ctxt.platform.get_platform_name() not in {CspEnv.DATABRICKS_AWS, CspEnv.DATABRICKS_AZURE}: | ||
tools_processed_apps[exec_engine_col_name] = default_exec_engine_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a function in the platform. DB-AWS/DB-azure can override it.
Later, we might have some logic to apply for the onprem.
For example, a customer running a custom Spark (onprem), then we will need to find a way to specify this "executionEngine".
:param file_name: Name of the metric file to read from each application's folder | ||
""" | ||
metrics = {} | ||
root_metric_dir = self.ctxt.get_metrics_output_folder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rather than using the legacy FSUtil, it might help to write this method using the storageLib.
This assures that we do not have to revisit this method when we support distributed output.
I don't have strong opinion on changing the implementation though.
# that the dictionary contains entries for all apps to avoid KeyErrors | ||
# and maintain consistency in processing. | ||
metrics[app_id_str] = pd.DataFrame() | ||
self.logger.warning('Unable to read metrics file for app %s. Reason - %s:%s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: at some point we need to find a better way to log those warning messages. When I run the tools, there is a ton of messages triggered by missing some of the profiler's output.
We could do improve that in a separate issue. We can buffer all those missing files and just dump a single warning message.
return tools_processed_apps | ||
|
||
# Create a map of App IDs to their execution engine type (Spark/Photon) | ||
spark_version_key = 'spark.databricks.clusterUsageTags.sparkVersion' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here..it is a platform specific that should not be defined by the Qual tool.
upperBound: 1000000.0 | ||
- columnName: 'Unsupported Operators Stage Duration Percent' | ||
lowerBound: 0.0 | ||
upperBound: 25.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs some thinking on the impact of design.
This introduces a platform configuration inside the tool's conf. On the other hand, we do have a configuration file per platform.
|
||
@classmethod | ||
def get_default(cls) -> 'ExecutionEngine': | ||
return cls.SPARK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have a class defined per platform? All the platforms will be bound to the default enumType. Then DB extends that definition adding the photon type.
@@ -349,3 +349,13 @@ def bytes_to_human_readable(cls, num_bytes: int) -> str: | |||
num_bytes /= 1024.0 | |||
i += 1 | |||
return f'{num_bytes:.2f} {size_units[i]}' | |||
|
|||
@classmethod | |||
def convert_df_to_dict(cls, df: pd.DataFrame) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't think this can really be a util function because it is very tailored to specific use-case. Perhaps if the method is more generic to return a dictionary of
{key -> {col1: val, col2: val,...}}
, then this would be a utility function. - If we want to keep it, we can consider moving this method to a separate util class/file. for example
df_ustils.py
to be the seed for any other helpers that we need for the dataframes.
{
"appId": "app-20240827220408-0000",
"appName": "Databricks Shell",
"cli": {
"executionEngine": "spark",
"estimatedGpuSpeedupCategory": "Small",
"clusterInfo": {
"platform": "databricks_aws",
"sourceCluster": {},
"recommendedCluster": {
"driverNodeType": "m6gd.xlarge",
"workerNodeType": "g5.2xlarge",
"numWorkerNodes": 2
}
}
},
"core": {
"eventLog": "file:/path/to/log",
"fullClusterConfigRecommendations": "/qual_xxx/rapids_4_spark_qualification_output/tuning/app-20240827220408-0000.conf",
"gpuConfigRecommendationBreakdown": "/qual_xxx/rapids_4_spark_qualification_output/tuning/app-20240827220408-0000.log"
}
} |
From offline discussions with @amahussein and @leewyang, moving the detection of runtime (Spark/Photon/Velox) to Scala. This PR will be refactored afterwards. |
@@ -41,3 +41,16 @@ Feature: Event Log Processing | |||
Qualification. Raised an error in phase [Execution] | |||
""" | |||
And return code is "1" | |||
|
|||
@test_id_ELP_0003 | |||
Scenario Outline: Qualification tool processes event logs with different execution engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: execution engines
Q: what is the behavior if a user input --platform onprem
, but with a photon event log? Does this fail in python side early?
Issue #251.
This PR introduces support for recommending Photon applications, using a separate strategy for categorizing them:
Additionally, the Small category for Photon applications is different from that of Spark-based applications:
Note
Output
executionEngine
inapp_metadata.json
Changes
Enhancements and New Features:
tool_ctxt.py
: Introduced a new methodget_metrics_output_folder
to fetch the metrics output directory.qualification-conf.yaml
: Updated configuration to include new metrics subfolder and execution engine settings. [1] [2] [3] [4]enums.py
: Added a newExecutionEngine
class to represent different execution engines.speedup_category.py
: IntroducedSpeedupStrategy
class and refactored methods to accommodate execution engine-specific speedup strategies. [1] [2] [3] [4]Refactoring and Utility Improvements:
qualification.py
: Added a helper method_read_qualification_metric_file
to read metric files and_assign_execution_engine_to_apps
to assign execution engines to applications.util.py
: Added a utility methodconvert_df_to_dict
to convert DataFrames to dictionaries.Tests:
event_log_processing.feature
: Added new test scenarios to validate the execution engine assignment.e2e_utils.py
andtest_steps.py
: Updated end-to-end test utilities to support new features. [1] [2] [3]Follow Up
Following changes will be needed from QualX:
spark_properties
and thus use specific model for specific app type