[FEA] Allow users to specify custom Dependency jars #1395

amahussein · 2024-10-24T20:30:24Z

Signed-off-by: Ahmed Hussein [email protected]

Add a new input argument that takes a path to a yaml file --tools_config_file The config file allows the users to define their own binaries that need to be added to the classpath of the tools jar cmd. This change is important because users can use the user-tools wrapper with their custom spark.

Sample usage and specifications

the dependenices can be https: or file:///
assume that the spark binaries are for spark-3.3.1-bin-hadoop3 which exists on local disk: /local/path/to/spark-3.3.1-bin-hadoop3.tgz

We can define the config-file /home/tools-conf.yaml as follows:

info:
  api_version: '1.0'
runtime:
  dependencies:
    - name: my-spark331
      uri: file:///local/path/to/spark-3.3.1-bin-hadoop3.tgz
      dependency_type:
        dep_type: archive
        # for tgz files, it is required to give the subfolder where the jars are located
        relative_path: jars/*

now, we can run the python command as :

spark_rapids profiling \
  --tools_config_file /home/tools-conf.yaml \
  --output_folder $LOCAL_OUTPUT_DIR \
  --eventlogs $LOCAL_GPU_EVENT_LOGS \
  --tools_jar $LOCAL_JAR \
  --verbose

Assume that there are more jar files needed for the classpath, then the user can define an array of dependencies

info:
  api_version: '1.0'
runtime:
  dependencies:
    - name: Hadoop AWS
      uri: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
      verification:
        size: 962685
    - name: AWS Java SDK Bundled
      uri: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
      verification:
        file_hash:
          algorithm: sha1
          value: 02deec3a0ad83d13d032b1812421b23d7a961eea
    - name: spark-3.5.0
      uri: https:///archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
      verification:
        file_hash:
          algorithm: sha512
          value: 8883c67e0a138069e597f3e7d4edbbd5c3a565d50b28644aad02856a1ec1da7cb92b8f80454ca427118f69459ea326eaa073cf7b1a860c3b796f4b07c2101319
      dependency_type:
        dep_type: archive
        relative_path: jars/*

The specification of the file is as follows

---
"$defs":
  DependencyType:
    description: Represents the dependency type for the jar cmd
    enum:
    - jar
    - archive
    - classpath
    title: DependencyType
    type: string
  DependencyVerification:
    description: The verification information of the dependency.
    properties:
      size:
        default: 0
        description: The size of the dependency file.
        examples:
        - 3265393
        title: Size
        type: integer
      file_hash:
        "$ref": "#/$defs/FileHashAlgorithm"
        default: 
        description: The hash function to verify the file
        examples:
        - algorithm: md5
          value: bc9bf7fedde0e700b974426fbd8d869c
    title: DependencyVerification
    type: object
  FileHashAlgorithm:
    description: |-
      Represents a file hash algorithm and its value. Used for verification against an
      existing file.
      ```py
      try:
          file_algo = FileHashAlgorithm(algorithm=HashAlgorithm.SHA256, value='...')
          file_algo.verify_file(CspPath('file://path/to/file'))
      except ValidationError as e:
          print(e)
      ```
    properties:
      algorithm:
        "$ref": "#/$defs/HashAlgorithm"
      value:
        title: Value
        type: string
    required:
    - algorithm
    - value
    title: FileHashAlgorithm
    type: object
  HashAlgorithm:
    description: Represents the supported hashing algorithms
    enum:
    - md5
    - sha1
    - sha256
    - sha512
    title: HashAlgorithm
    type: string
  RuntimeDependency:
    description: |-
      The runtime dependency required by the tools Jar cmd. All elements are downloaded and added
      to the classPath.
    properties:
      name:
        description: The name of the dependency
        title: Name
        type: string
      uri:
        anyOf:
        - format: uri
          minLength: 1
          type: string
        - format: file-path
          type: string
        description: The FileURI of the dependency. It can be a local file or a remote
          file
        examples:
        - file:///path/to/file.jar
        - https://mvn-url/24.08.1/rapids-4-spark-tools_2.12-24.08.1.jar
        - gs://bucket-name/path/to/file.jar
        title: Uri
      dependency_type:
        "$ref": "#/$defs/RuntimeDependencyType"
        description: The type of the dependency and how to find the lib files after
          decompression.
      verification:
        "$ref": "#/$defs/DependencyVerification"
        default: 
        examples:
        - size: 3265393
        - fileHash:
            algorithm: md5
            value: bc9bf7fedde0e700b974426fbd8d869c
    required:
    - name
    - uri
    title: RuntimeDependency
    type: object
  RuntimeDependencyType:
    description: |-
      Defines the type of dependency. It can be one of the following:
         - Archived file (.tgz)
         - Simple JAR file (*.jar)
         - Classpath directory (not yet supported)

      Note: The 'classpath' type is reserved for future use, allowing users to point to a directory
      in the classpath without needing to download or copy any binaries.
    properties:
      dep_type:
        "$ref": "#/$defs/DependencyType"
        description: The type of the dependency
      relative_path:
        default: 
        description: The relative path of the dependency in the classpath. This is
          relevant for tar files
        examples:
        - jars/*
        title: Relative Path
        type: string
    required:
    - dep_type
    title: RuntimeDependencyType
    type: object
  ToolsConfigInfo:
    description: |-
      High level metadata about the tools configurations (i.e., the api-version and any other relevant
      metadata).
    properties:
      api_version:
        description: The version of the API that the tools are using. This is used
          to test the compatibility of the configuration file against the current
          tools release
        examples:
        - '1.0'
        maximum: 1
        minimum: 1
        title: Api Version
        type: number
    required:
    - api_version
    title: ToolsConfigInfo
    type: object
  ToolsRuntimeConfig:
    description: The runtime configurations of the tools as defined by the user.
    properties:
      dependencies:
        description: The list of runtime dependencies required by the tools Jar cmd.
          All elements are downloaded and added to the classPath
        items:
          "$ref": "#/$defs/RuntimeDependency"
        title: Dependencies
        type: array
    required:
    - dependencies
    title: ToolsRuntimeConfig
    type: object
description: Main container for the user's defined tools configuration
properties:
  info:
    "$ref": "#/$defs/ToolsConfigInfo"
    description: Metadata information of the tools configuration
  runtime:
    "$ref": "#/$defs/ToolsRuntimeConfig"
    description: Configuration related to the runtime environment of the tools
required:
- info
- runtime
title: ToolsConfig
type: object

detailed code changes

This pull request includes several changes to the user_tools module, focusing on improving dependency management and configuration handling. The changes include importing new modules, adding methods for handling tool configurations, and updating the way dependencies are processed and verified.

Configuration Handling Enhancements:

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py: Added a method get_tools_config_obj to retrieve the tools configuration object from CLI arguments.
user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py: Added a method populate_dependency_list to check for dependencies in a config file before falling back to default dependencies.

Minor Code Improvements:

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py: Corrected a comment typo in _process_output_args method.
user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py: Adjusted the formatting of the get_rapids_tools_dependencies method for better readability.
user_tools/src/spark_rapids_pytools/resources/dev/prepackage_mgr.py: Added import for CspPath and RuntimeDependency to support new dependency handling. [1] [2]

Modified the format of platform configuration to match the same structure

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py: Added imports for RuntimeDependency, ToolsConfig, and DependencyType to better manage dependencies. Updated methods to use RuntimeDependency objects instead of dictionaries. [1] [2]
user_tools/src/spark_rapids_pytools/resources/databricks_aws-configs.json: Replaced hashLib with fileHash for verification and added dependencyType for better dependency classification. [1] [2] [3]
user_tools/src/spark_rapids_pytools/resources/databricks_azure-configs.json: Similar changes to the AWS configs, updating verification and dependency type fields.
user_tools/src/spark_rapids_pytools/resources/dataproc-configs.json: Updated verification and dependency type fields for Dataproc configurations.
user_tools/src/spark_rapids_pytools/resources/dataproc_gke-configs.json: Similar updates to verification and dependency type fields for GKE configurations.

Signed-off-by: Ahmed Hussein <[email protected]> Fixes NVIDIA#1359 Add a new input argument that takes a path to a yaml file `--tools_config_file` The config file allows the users to define their own binaries that need to be added to the classpath of the tools jar cmd. This change is important because users can use the user-tools wrapper with their custom spark.

Signed-off-by: Ahmed Hussein <[email protected]>

parthosa

Thanks @amahussein for this PR. Adding a module for configurations using pydantic model was interesting. Minor comments.

user_tools/src/spark_rapids_tools/configuration/common.py

user_tools/src/spark_rapids_pytools/resources/dev/prepackage_mgr.py

Signed-off-by: Ahmed Hussein <[email protected]>

parthosa

Thanks @amahussein. LGTME. The configuration module could be very useful in future. Setting runtime properties (e.g. thresholds for TCV, distributed spark configs etc)

nartal1

Thanks @amahussein ! LGTM.

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py

user_tools/src/spark_rapids_tools/enums.py

cindyyuanjiang

Thanks @amahussein! Very minor nits.

tgravescs · 2024-10-28T14:50:18Z

one general comment about the description is that it doesn't seem like the section "The specification of the file is as follows".. matches what the real spec is. Or atl east its very hard to read and match them up. was this auto generated?

For instance:

DependencyType:
    description: Represents the dependency type for the jar cmd
    enum:
    - jar
    - archive
    - classpath
    title: DependencyType
    type: string

Description is dependency type for the jar command (what is jar command here?)... but enum is jar, archive, or classpath. It doesn't seem to match dependency_type field in the example above it.

dependency_type:
        dep_type: archive
        relative_path: jars/*

I assume some fields are optional.. like dependency_type: is only listed in your example for one of the 3 dependencies?

Also the description doesn't detail how this interacts with the default spark downloads, please add those details.

tgravescs · 2024-10-28T15:08:53Z

when I run with a tools config file with only the info and api version it fails with:

Traceback (most recent call last):
....
TypeError: argument 'message_template': 'ValidationError' object cannot be converted to 'PyString'

Is there a way to get a more informative error message?

user_tools/src/spark_rapids_tools/cmdli/tools_cli.py

tgravescs · 2024-10-28T14:44:39Z

user_tools/src/spark_rapids_tools/configuration/tools_config.py

+
+class ToolsConfig(BaseModel):
+    """Main container for the user's defined tools configuration"""
+    info: ToolsConfigInfo = Field(


info: api_version: '1.0'

Is there a reason the api version if under info here vs just being at the top level?
I would have expected it at the top level so that all the fields could be based on this version field (including info).

You are right. standard is the api-version be at the root level.

user_tools/src/spark_rapids_tools/configuration/tools_config.py

tgravescs · 2024-10-28T15:14:56Z

user_tools/src/spark_rapids_tools/configuration/common.py

+
+
+class RuntimeDependency(BaseModel):
+    """The runtime dependency required by the tools Jar cmd. All elements are downloaded and added


nit: "Jar" in other places isn't capitalized. I now get what the "jar cmd" is here. its referring calling into the java command from the python side. I wonder if using like "java" instead of jar would be a more obvious to the user.

Good idea! I am relacing the jar by Java

tgravescs · 2024-10-28T15:17:18Z

user_tools/src/spark_rapids_tools/cmdli/argprocessor.py

@@ -370,6 +372,23 @@ def process_jvm_args(self) -> None:
        self.p_args['toolArgs']['jobResources'] = adjusted_resources
        self.p_args['toolArgs']['log4jPath'] = Utils.resource_path('dev/log4j.properties')

+    def process_tools_config(self) -> None:
+        """
+        Load the tools config file if it is provided. it creates a ToolsConfig object and sets it


nit "it" should be capitalized. perhaps rename to like load_tools_config or set_tools_config so its not confusing that this is not "processing the actual dependencies"

tgravescs · 2024-10-28T15:20:39Z

user_tools/src/spark_rapids_tools/configuration/common.py

+        validation_alias=AliasChoices('dep_type', 'depType'))
+    relative_path: str = Field(
+        default=None,
+        description='The relative path of the dependency in the classpath. This is relevant for tar files',


"This is relevant for tar files".. does this means I can't use .gz files or just .jar files?

I will clarify the description. This field is used to tell the tools which subdirectory to find the additional jars. Therefore, it is only relevant with "archive" like spark.tgz because it produces an entire directory.
if the dependency is .jar file , then such information is not needed because the jar file is added to the classpath.

amahussein · 2024-10-30T18:45:54Z

one general comment about the description is that it doesn't seem like the section "The specification of the file is as follows".. matches what the real spec is. Or atl east its very hard to read and match them up. was this auto generated?

For instance:
DependencyType:
    description: Represents the dependency type for the jar cmd
    enum:
    - jar
    - archive
    - classpath
    title: DependencyType
    type: string
Description is dependency type for the jar command (what is jar command here?)... but enum is jar, archive, or classpath. It doesn't seem to match dependency_type field in the example above it.
dependency_type:
        dep_type: archive
        relative_path: jars/*
I assume some fields are optional.. like dependency_type: is only listed in your example for one of the 3 dependencies?

Also the description doesn't detail how this interacts with the default spark downloads, please add those details.

Thanks @tgravescs

I think the issue here that the autogenerated file picks both the field description and the class pydoc which causes it to be confusing.
There is some control on the autogenerated file by excluding fields, adding examples...etc. I will revisit it to pick a better explanation of the schema

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Signed-off-by: Ahmed Hussein <[email protected]>

amahussein · 2024-10-31T21:18:41Z

Thanks @tgravescs
As we discussed offline:

move the api_version at the root level
I updated the description of fields
I added unit tests to have clear understanding of valid/invalid files and the expected behavior
I fixed the errors to be more clear. Specifying which fields are failing.
I put a sample specification file in the test folders to be referenced. In addition, I added github link to the valid yaml files in the output of the help cmd.

New specifications after the changes for convenience:
https://github.com/NVIDIA/spark-rapids-tools/pull/1395/files#diff-79a9ce702cf001815401a8737b8ec035c1f7ed01d73a5360f2b4f0142807be5b

user_tools/src/spark_rapids_tools/cmdli/argprocessor.py

user_tools/src/spark_rapids_tools/configuration/common.py

cindyyuanjiang

Thanks @amahussein! Some minor nits and questions.

Signed-off-by: Ahmed Hussein <[email protected]>

parthosa

Thanks @amahussein. This was a nice approach.

cindyyuanjiang

Thanks @amahussein for adding this feature!

amahussein added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) usability track issues related to the Tools's user experience labels Oct 24, 2024

amahussein self-assigned this Oct 24, 2024

amahussein requested a review from wjxiz1992 October 24, 2024 20:33

fix typo in aws configs

82a89ef

Signed-off-by: Ahmed Hussein <[email protected]>

amahussein changed the title ~~Allow users to specify custom Dependency jars~~ [FEA] Allow users to specify custom Dependency jars Oct 24, 2024

amahussein mentioned this pull request Oct 24, 2024

[FEA] Allow users to specify custom Dependency jars (spark) #1359

Closed

3 tasks

amahussein requested review from parthosa, cindyyuanjiang and nartal1 October 24, 2024 22:00

parthosa reviewed Oct 25, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/configuration/common.py Outdated Show resolved Hide resolved

user_tools/src/spark_rapids_pytools/resources/dev/prepackage_mgr.py Show resolved Hide resolved

fix default dependencyType

505273d

Signed-off-by: Ahmed Hussein <[email protected]>

parthosa previously approved these changes Oct 25, 2024

View reviewed changes

nartal1 previously approved these changes Oct 25, 2024

View reviewed changes

cindyyuanjiang reviewed Oct 27, 2024

View reviewed changes

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py Outdated Show resolved Hide resolved

cindyyuanjiang reviewed Oct 27, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/enums.py Outdated Show resolved Hide resolved

cindyyuanjiang previously approved these changes Oct 27, 2024

View reviewed changes

tgravescs reviewed Oct 28, 2024

View reviewed changes

amahussein and others added 2 commits October 31, 2024 12:36

add unit tests for toolsconfig

97b2371

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

improve field descriptions

153661e

Signed-off-by: Ahmed Hussein <[email protected]>

amahussein dismissed stale reviews from cindyyuanjiang, nartal1, and parthosa via 153661e October 31, 2024 21:13

amahussein requested a review from parthosa October 31, 2024 21:25

amahussein requested review from cindyyuanjiang and tgravescs October 31, 2024 21:30

cindyyuanjiang reviewed Nov 1, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/cmdli/argprocessor.py Outdated Show resolved Hide resolved

cindyyuanjiang reviewed Nov 1, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/configuration/common.py Outdated Show resolved Hide resolved

cindyyuanjiang reviewed Nov 1, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/configuration/common.py Outdated Show resolved Hide resolved

cindyyuanjiang reviewed Nov 1, 2024

View reviewed changes

address review comments

cf9e295

Signed-off-by: Ahmed Hussein <[email protected]>

parthosa approved these changes Nov 1, 2024

View reviewed changes

cindyyuanjiang approved these changes Nov 1, 2024

View reviewed changes

tgravescs approved these changes Nov 4, 2024

View reviewed changes

amahussein merged commit ca6cac6 into NVIDIA:dev Nov 4, 2024
14 checks passed

amahussein deleted the rapids-tools-1359-part3 branch November 4, 2024 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Allow users to specify custom Dependency jars #1395

[FEA] Allow users to specify custom Dependency jars #1395

amahussein commented Oct 24, 2024 •

edited

Loading

parthosa left a comment

parthosa left a comment

nartal1 left a comment

cindyyuanjiang left a comment

tgravescs commented Oct 28, 2024 •

edited

Loading

tgravescs commented Oct 28, 2024

tgravescs Oct 28, 2024

amahussein Oct 30, 2024

tgravescs Oct 28, 2024

amahussein Oct 30, 2024

tgravescs Oct 28, 2024

tgravescs Oct 28, 2024

amahussein Oct 30, 2024

amahussein commented Oct 30, 2024

amahussein commented Oct 31, 2024 •

edited

Loading

cindyyuanjiang left a comment

parthosa left a comment

cindyyuanjiang left a comment



		class RuntimeDependency(BaseModel):
		"""The runtime dependency required by the tools Jar cmd. All elements are downloaded and added

[FEA] Allow users to specify custom Dependency jars #1395

[FEA] Allow users to specify custom Dependency jars #1395

Conversation

amahussein commented Oct 24, 2024 • edited Loading

Sample usage and specifications

detailed code changes

Configuration Handling Enhancements:

Minor Code Improvements:

Modified the format of platform configuration to match the same structure

parthosa left a comment

Choose a reason for hiding this comment

parthosa left a comment

Choose a reason for hiding this comment

nartal1 left a comment

Choose a reason for hiding this comment

cindyyuanjiang left a comment

Choose a reason for hiding this comment

tgravescs commented Oct 28, 2024 • edited Loading

tgravescs commented Oct 28, 2024

tgravescs Oct 28, 2024

Choose a reason for hiding this comment

amahussein Oct 30, 2024

Choose a reason for hiding this comment

tgravescs Oct 28, 2024

Choose a reason for hiding this comment

amahussein Oct 30, 2024

Choose a reason for hiding this comment

tgravescs Oct 28, 2024

Choose a reason for hiding this comment

tgravescs Oct 28, 2024

Choose a reason for hiding this comment

amahussein Oct 30, 2024

Choose a reason for hiding this comment

amahussein commented Oct 30, 2024

amahussein commented Oct 31, 2024 • edited Loading

cindyyuanjiang left a comment

Choose a reason for hiding this comment

parthosa left a comment

Choose a reason for hiding this comment

cindyyuanjiang left a comment

Choose a reason for hiding this comment

amahussein commented Oct 24, 2024 •

edited

Loading

tgravescs commented Oct 28, 2024 •

edited

Loading

amahussein commented Oct 31, 2024 •

edited

Loading