Add qualification support for Photon jobs in the Python Tool #1409

parthosa · 2024-11-02T00:17:21Z

Issue #251.

This PR introduces support for recommending Photon applications, using a separate strategy for categorizing them:

Spark Execution Engine: Recommend apps with a speedup greater than 1.3x.
Photon Execution Engine: Recommend apps with a speedup greater than 1x.

Additionally, the Small category for Photon applications is different from that of Spark-based applications:

Spark Execution Engine: Apps with a speedup in the range of 1.3x to 2x are categorized as Small.
Photon Execution Engine: Apps with a speedup in the range of 1x to 2x are categorized as Small.

Note

Speedup Strategy is assigned on a per-app basis, enabling support for heterogeneous cases.
Hence, if a user provides both Photon and Spark event logs, the Python Tool will apply separate strategy for each app based on its execution engine (Spark or Photon)

Output

As this is a metadata property, for each app, included an entry executionEngine in app_metadata.json

  {
    "appId": "app-20240818062343-0000",
    "appName": "Databricks Shell",
    "eventLog": "file:/path/to/log/photon_eventlog",
    "executionEngine": "photon",
    "estimatedGpuSpeedupCategory": "Not Recommended"
  }

Changes

Enhancements and New Features:

tool_ctxt.py: Introduced a new method get_metrics_output_folder to fetch the metrics output directory.
qualification-conf.yaml: Updated configuration to include new metrics subfolder and execution engine settings. [1] [2] [3] [4]
enums.py: Added a new ExecutionEngine class to represent different execution engines.
speedup_category.py: Introduced SpeedupStrategy class and refactored methods to accommodate execution engine-specific speedup strategies. [1] [2] [3] [4]

Refactoring and Utility Improvements:

qualification.py: Added a helper method _read_qualification_metric_file to read metric files and _assign_execution_engine_to_apps to assign execution engines to applications.
util.py: Added a utility method convert_df_to_dict to convert DataFrames to dictionaries.

Tests:

event_log_processing.feature: Added new test scenarios to validate the execution engine assignment.
e2e_utils.py and test_steps.py: Updated end-to-end test utilities to support new features. [1] [2] [3]

Follow Up

Following changes will be needed from QualX:

Add photon specific models
Identify apps as photon or not based on the spark_properties and thus use specific model for specific app type

Signed-off-by: Partho Sarthi <[email protected]>

amahussein

Thanks @parthosa !
Just for sake of confirmation:

Is there another followup PR to change the QualX module to read the app_meta.json to decide whether this app is photon or not? In that case the PR description is not accurate because it gives impression that it adds support e-2-e.
I am concerned about how we can troubleshoot and validate app_meta.json. the wrapper reads the autotuner's output and copy some of the fields to that file in the upper level. With this PR, we are adding a new field derived from python logic. Later, we will hit a question "Where does each field come from?" (this becomes even more challenging if fields might be overridden by Python wrapper). CC: @tgravescs

amahussein · 2024-11-06T16:56:20Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

+        if self.ctxt.platform.get_platform_name() not in {CspEnv.DATABRICKS_AWS, CspEnv.DATABRICKS_AZURE}:
+            tools_processed_apps[exec_engine_col_name] = default_exec_engine_type


This could be a function in the platform. DB-AWS/DB-azure can override it.
Later, we might have some logic to apply for the onprem.
For example, a customer running a custom Spark (onprem), then we will need to find a way to specify this "executionEngine".

amahussein · 2024-11-06T17:02:45Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

+       :param file_name: Name of the metric file to read from each application's folder
+       """
+        metrics = {}
+        root_metric_dir = self.ctxt.get_metrics_output_folder()


nit: Rather than using the legacy FSUtil, it might help to write this method using the storageLib.
This assures that we do not have to revisit this method when we support distributed output.
I don't have strong opinion on changing the implementation though.

amahussein · 2024-11-06T17:06:29Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

+                # that the dictionary contains entries for all apps to avoid KeyErrors
+                # and maintain consistency in processing.
+                metrics[app_id_str] = pd.DataFrame()
+                self.logger.warning('Unable to read metrics file for app %s. Reason - %s:%s',


nit: at some point we need to find a better way to log those warning messages. When I run the tools, there is a ton of messages triggered by missing some of the profiler's output.
We could do improve that in a separate issue. We can buffer all those missing files and just dump a single warning message.

amahussein · 2024-11-06T17:07:58Z

user_tools/src/spark_rapids_pytools/rapids/qualification.py

+            return tools_processed_apps
+
+        # Create a map of App IDs to their execution engine type (Spark/Photon)
+        spark_version_key = 'spark.databricks.clusterUsageTags.sparkVersion'


same here..it is a platform specific that should not be defined by the Qual tool.

amahussein · 2024-11-06T17:15:42Z

user_tools/src/spark_rapids_pytools/resources/qualification-conf.yaml

+              upperBound: 1000000.0
+            - columnName: 'Unsupported Operators Stage Duration Percent'
+              lowerBound: 0.0
+              upperBound: 25.0


This needs some thinking on the impact of design.
This introduces a platform configuration inside the tool's conf. On the other hand, we do have a configuration file per platform.

amahussein · 2024-11-06T17:17:02Z

user_tools/src/spark_rapids_tools/enums.py

+
+    @classmethod
+    def get_default(cls) -> 'ExecutionEngine':
+        return cls.SPARK


should we have a class defined per platform? All the platforms will be bound to the default enumType. Then DB extends that definition adding the photon type.

amahussein · 2024-11-06T17:22:13Z

user_tools/src/spark_rapids_tools/utils/util.py

@@ -349,3 +349,13 @@ def bytes_to_human_readable(cls, num_bytes: int) -> str:
            num_bytes /= 1024.0
            i += 1
        return f'{num_bytes:.2f} {size_units[i]}'
+
+    @classmethod
+    def convert_df_to_dict(cls, df: pd.DataFrame) -> dict:


I don't think this can really be a util function because it is very tailored to specific use-case. Perhaps if the method is more generic to return a dictionary of {key -> {col1: val, col2: val,...}}, then this would be a utility function.

If we want to keep it, we can consider moving this method to a separate util class/file. for example df_ustils.py to be the seed for any other helpers that we need for the dataframes.

parthosa · 2024-11-06T18:27:18Z

Is there another followup PR to change the QualX module to read the app_meta.json to decide whether this app is photon or not?

There will be a PR from QualX having the changes: (1) Add photon specific models (2) Identify apps as photon or not based on the spark_properties and thus use specific model for specific app type
Updated the PR description with the follow-up.

I am concerned about how we can troubleshoot and validate app_meta.json. the wrapper reads the autotuner's output and copy some of the fields to that file in the upper level. With this PR, we are adding a new field derived from python logic. Later, we will hit a question "Where does each field come from?" (this becomes even more challenging if fields might be overridden by Python wrapper).

Agreed. To avoid this confusion, we could introduce a level to distinguish where each field is coming from. Something like:

  {
    "appId": "app-20240827220408-0000",
    "appName": "Databricks Shell",
    "cli": {
      "executionEngine": "spark",
      "estimatedGpuSpeedupCategory": "Small",
      "clusterInfo": {
        "platform": "databricks_aws",
        "sourceCluster": {},
        "recommendedCluster": {
          "driverNodeType": "m6gd.xlarge",
          "workerNodeType": "g5.2xlarge",
          "numWorkerNodes": 2
        }
      }
    },
    "core": {
      "eventLog": "file:/path/to/log",
      "fullClusterConfigRecommendations": "/qual_xxx/rapids_4_spark_qualification_output/tuning/app-20240827220408-0000.conf",
      "gpuConfigRecommendationBreakdown": "/qual_xxx/rapids_4_spark_qualification_output/tuning/app-20240827220408-0000.log"
    }
  }

parthosa · 2024-11-06T23:12:12Z

From offline discussions with @amahussein and @leewyang, moving the detection of runtime (Spark/Photon/Velox) to Scala.

This PR will be refactored afterwards.

cindyyuanjiang · 2024-11-06T23:31:57Z

user_tools/tests/spark_rapids_tools_e2e/features/event_log_processing.feature

@@ -41,3 +41,16 @@ Feature: Event Log Processing
      Qualification. Raised an error in phase [Execution]
      """
    And return code is "1"
+
+  @test_id_ELP_0003
+  Scenario Outline: Qualification tool processes event logs with different execution engine


nit: execution engines

Q: what is the behavior if a user input --platform onprem, but with a photon event log? Does this fail in python side early?

parthosa added 5 commits November 1, 2024 16:30

Add support for different speedup threshold in Photon

cd0ab54

Signed-off-by: Partho Sarthi <[email protected]>

Add speed up strategy per app

825bb95

Signed-off-by: Partho Sarthi <[email protected]>

Assign app execution engine and categorize speedup based on engine type

f29057e

Signed-off-by: Partho Sarthi <[email protected]>

Add E2E test cases

8921a85

Signed-off-by: Partho Sarthi <[email protected]>

Rename App Execution Engine to Execution Engine

bf9d0d0

Signed-off-by: Partho Sarthi <[email protected]>

parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Nov 2, 2024

parthosa self-assigned this Nov 2, 2024

Rename loop variables

53f60ac

Signed-off-by: Partho Sarthi <[email protected]>

parthosa marked this pull request as ready for review November 4, 2024 20:19

parthosa requested review from tgravescs, cindyyuanjiang, amahussein and nartal1 November 4, 2024 20:21

parthosa added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Nov 4, 2024

parthosa mentioned this pull request Nov 4, 2024

[FEA] Add qualification support for Databricks Photon event logs #251

Open

amahussein reviewed Nov 6, 2024

View reviewed changes

parthosa marked this pull request as draft November 6, 2024 23:12

cindyyuanjiang reviewed Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add qualification support for Photon jobs in the Python Tool #1409

Add qualification support for Photon jobs in the Python Tool #1409

parthosa commented Nov 2, 2024 •

edited

Loading

amahussein left a comment

amahussein Nov 6, 2024

amahussein Nov 6, 2024

amahussein Nov 6, 2024

amahussein Nov 6, 2024

amahussein Nov 6, 2024

amahussein Nov 6, 2024

amahussein Nov 6, 2024

parthosa commented Nov 6, 2024 •

edited

Loading

parthosa commented Nov 6, 2024

cindyyuanjiang Nov 6, 2024

		if self.ctxt.platform.get_platform_name() not in {CspEnv.DATABRICKS_AWS, CspEnv.DATABRICKS_AZURE}:
		tools_processed_apps[exec_engine_col_name] = default_exec_engine_type

Add qualification support for Photon jobs in the Python Tool #1409

Are you sure you want to change the base?

Add qualification support for Photon jobs in the Python Tool #1409

Conversation

parthosa commented Nov 2, 2024 • edited Loading

Note

Output

Changes

Enhancements and New Features:

Refactoring and Utility Improvements:

Tests:

Follow Up

amahussein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthosa commented Nov 6, 2024 • edited Loading

parthosa commented Nov 6, 2024

Choose a reason for hiding this comment

parthosa commented Nov 2, 2024 •

edited

Loading

parthosa commented Nov 6, 2024 •

edited

Loading