Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add qualification support for Photon jobs in the Python Tool #1409

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Nov 2, 2024

Issue #251.

This PR introduces support for recommending Photon applications, using a separate strategy for categorizing them:

  • Spark Execution Engine: Recommend apps with a speedup greater than 1.3x.
  • Photon Execution Engine: Recommend apps with a speedup greater than 1x.

Additionally, the Small category for Photon applications is different from that of Spark-based applications:

  • Spark Execution Engine: Apps with a speedup in the range of 1.3x to 2x are categorized as Small.
  • Photon Execution Engine: Apps with a speedup in the range of 1x to 2x are categorized as Small.

Note

  • Speedup Strategy is assigned on a per-app basis, enabling support for heterogeneous cases.
  • Hence, if a user provides both Photon and Spark event logs, the Python Tool will apply separate strategy for each app based on its execution engine (Spark or Photon)

Output

  • As this is a metadata property, for each app, included an entry executionEngine in app_metadata.json
  {
    "appId": "app-20240818062343-0000",
    "appName": "Databricks Shell",
    "eventLog": "file:/path/to/log/photon_eventlog",
    "executionEngine": "photon",
    "estimatedGpuSpeedupCategory": "Not Recommended"
  }

Changes

Enhancements and New Features:

  • tool_ctxt.py: Introduced a new method get_metrics_output_folder to fetch the metrics output directory.
  • qualification-conf.yaml: Updated configuration to include new metrics subfolder and execution engine settings. [1] [2] [3] [4]
  • enums.py: Added a new ExecutionEngine class to represent different execution engines.
  • speedup_category.py: Introduced SpeedupStrategy class and refactored methods to accommodate execution engine-specific speedup strategies. [1] [2] [3] [4]

Refactoring and Utility Improvements:

  • qualification.py: Added a helper method _read_qualification_metric_file to read metric files and _assign_execution_engine_to_apps to assign execution engines to applications.
  • util.py: Added a utility method convert_df_to_dict to convert DataFrames to dictionaries.

Tests:

  • event_log_processing.feature: Added new test scenarios to validate the execution engine assignment.
  • e2e_utils.py and test_steps.py: Updated end-to-end test utilities to support new features. [1] [2] [3]

Follow Up

Following changes will be needed from QualX:

  1. Add photon specific models
  2. Identify apps as photon or not based on the spark_properties and thus use specific model for specific app type

@parthosa parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Nov 2, 2024
@parthosa parthosa self-assigned this Nov 2, 2024
Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa marked this pull request as ready for review November 4, 2024 20:19
@parthosa parthosa added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Nov 4, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa !
Just for sake of confirmation:

  • Is there another followup PR to change the QualX module to read the app_meta.json to decide whether this app is photon or not? In that case the PR description is not accurate because it gives impression that it adds support e-2-e.
  • I am concerned about how we can troubleshoot and validate app_meta.json. the wrapper reads the autotuner's output and copy some of the fields to that file in the upper level. With this PR, we are adding a new field derived from python logic. Later, we will hit a question "Where does each field come from?" (this becomes even more challenging if fields might be overridden by Python wrapper). CC: @tgravescs

Comment on lines +651 to +652
if self.ctxt.platform.get_platform_name() not in {CspEnv.DATABRICKS_AWS, CspEnv.DATABRICKS_AZURE}:
tools_processed_apps[exec_engine_col_name] = default_exec_engine_type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a function in the platform. DB-AWS/DB-azure can override it.
Later, we might have some logic to apply for the onprem.
For example, a customer running a custom Spark (onprem), then we will need to find a way to specify this "executionEngine".

:param file_name: Name of the metric file to read from each application's folder
"""
metrics = {}
root_metric_dir = self.ctxt.get_metrics_output_folder()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rather than using the legacy FSUtil, it might help to write this method using the storageLib.
This assures that we do not have to revisit this method when we support distributed output.
I don't have strong opinion on changing the implementation though.

# that the dictionary contains entries for all apps to avoid KeyErrors
# and maintain consistency in processing.
metrics[app_id_str] = pd.DataFrame()
self.logger.warning('Unable to read metrics file for app %s. Reason - %s:%s',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: at some point we need to find a better way to log those warning messages. When I run the tools, there is a ton of messages triggered by missing some of the profiler's output.
We could do improve that in a separate issue. We can buffer all those missing files and just dump a single warning message.

return tools_processed_apps

# Create a map of App IDs to their execution engine type (Spark/Photon)
spark_version_key = 'spark.databricks.clusterUsageTags.sparkVersion'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here..it is a platform specific that should not be defined by the Qual tool.

upperBound: 1000000.0
- columnName: 'Unsupported Operators Stage Duration Percent'
lowerBound: 0.0
upperBound: 25.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some thinking on the impact of design.
This introduces a platform configuration inside the tool's conf. On the other hand, we do have a configuration file per platform.


@classmethod
def get_default(cls) -> 'ExecutionEngine':
return cls.SPARK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a class defined per platform? All the platforms will be bound to the default enumType. Then DB extends that definition adding the photon type.

@@ -349,3 +349,13 @@ def bytes_to_human_readable(cls, num_bytes: int) -> str:
num_bytes /= 1024.0
i += 1
return f'{num_bytes:.2f} {size_units[i]}'

@classmethod
def convert_df_to_dict(cls, df: pd.DataFrame) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I don't think this can really be a util function because it is very tailored to specific use-case. Perhaps if the method is more generic to return a dictionary of {key -> {col1: val, col2: val,...}}, then this would be a utility function.
  • If we want to keep it, we can consider moving this method to a separate util class/file. for example df_ustils.py to be the seed for any other helpers that we need for the dataframes.

@parthosa
Copy link
Collaborator Author

parthosa commented Nov 6, 2024

Is there another followup PR to change the QualX module to read the app_meta.json to decide whether this app is photon or not?

  • There will be a PR from QualX having the changes: (1) Add photon specific models (2) Identify apps as photon or not based on the spark_properties and thus use specific model for specific app type
  • Updated the PR description with the follow-up.

I am concerned about how we can troubleshoot and validate app_meta.json. the wrapper reads the autotuner's output and copy some of the fields to that file in the upper level. With this PR, we are adding a new field derived from python logic. Later, we will hit a question "Where does each field come from?" (this becomes even more challenging if fields might be overridden by Python wrapper).

  • Agreed. To avoid this confusion, we could introduce a level to distinguish where each field is coming from. Something like:
  {
    "appId": "app-20240827220408-0000",
    "appName": "Databricks Shell",
    "cli": {
      "executionEngine": "spark",
      "estimatedGpuSpeedupCategory": "Small",
      "clusterInfo": {
        "platform": "databricks_aws",
        "sourceCluster": {},
        "recommendedCluster": {
          "driverNodeType": "m6gd.xlarge",
          "workerNodeType": "g5.2xlarge",
          "numWorkerNodes": 2
        }
      }
    },
    "core": {
      "eventLog": "file:/path/to/log",
      "fullClusterConfigRecommendations": "/qual_xxx/rapids_4_spark_qualification_output/tuning/app-20240827220408-0000.conf",
      "gpuConfigRecommendationBreakdown": "/qual_xxx/rapids_4_spark_qualification_output/tuning/app-20240827220408-0000.log"
    }
  }

@parthosa
Copy link
Collaborator Author

parthosa commented Nov 6, 2024

From offline discussions with @amahussein and @leewyang, moving the detection of runtime (Spark/Photon/Velox) to Scala.

This PR will be refactored afterwards.

@parthosa parthosa marked this pull request as draft November 6, 2024 23:12
@@ -41,3 +41,16 @@ Feature: Event Log Processing
Qualification. Raised an error in phase [Execution]
"""
And return code is "1"

@test_id_ELP_0003
Scenario Outline: Qualification tool processes event logs with different execution engine
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: execution engines

Q: what is the behavior if a user input --platform onprem, but with a photon event log? Does this fail in python side early?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants