Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Allow users to specify custom Dependency jars #1395

Merged
merged 6 commits into from
Nov 4, 2024

Conversation

amahussein
Copy link
Collaborator

@amahussein amahussein commented Oct 24, 2024

Signed-off-by: Ahmed Hussein [email protected]

Fixes #1359

Add a new input argument that takes a path to a yaml file --tools_config_file The config file allows the users to define their own binaries that need to be added to the classpath of the tools jar cmd. This change is important because users can use the user-tools wrapper with their custom spark.

Sample usage and specifications

  • the dependenices can be https: or file:///

  • assume that the spark binaries are for spark-3.3.1-bin-hadoop3 which exists on local disk: /local/path/to/spark-3.3.1-bin-hadoop3.tgz

  • We can define the config-file /home/tools-conf.yaml as follows:

    info:
      api_version: '1.0'
    runtime:
      dependencies:
        - name: my-spark331
          uri: file:///local/path/to/spark-3.3.1-bin-hadoop3.tgz
          dependency_type:
            dep_type: archive
            # for tgz files, it is required to give the subfolder where the jars are located
            relative_path: jars/*
  • now, we can run the python command as :

    spark_rapids profiling \
      --tools_config_file /home/tools-conf.yaml \
      --output_folder $LOCAL_OUTPUT_DIR \
      --eventlogs $LOCAL_GPU_EVENT_LOGS \
      --tools_jar $LOCAL_JAR \
      --verbose
    
  • Assume that there are more jar files needed for the classpath, then the user can define an array of dependencies

    info:
      api_version: '1.0'
    runtime:
      dependencies:
        - name: Hadoop AWS
          uri: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
          verification:
            size: 962685
        - name: AWS Java SDK Bundled
          uri: https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
          verification:
            file_hash:
              algorithm: sha1
              value: 02deec3a0ad83d13d032b1812421b23d7a961eea
        - name: spark-3.5.0
          uri: https:///archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
          verification:
            file_hash:
              algorithm: sha512
              value: 8883c67e0a138069e597f3e7d4edbbd5c3a565d50b28644aad02856a1ec1da7cb92b8f80454ca427118f69459ea326eaa073cf7b1a860c3b796f4b07c2101319
          dependency_type:
            dep_type: archive
            relative_path: jars/*
The specification of the file is as follows

---
"$defs":
  DependencyType:
    description: Represents the dependency type for the jar cmd
    enum:
    - jar
    - archive
    - classpath
    title: DependencyType
    type: string
  DependencyVerification:
    description: The verification information of the dependency.
    properties:
      size:
        default: 0
        description: The size of the dependency file.
        examples:
        - 3265393
        title: Size
        type: integer
      file_hash:
        "$ref": "#/$defs/FileHashAlgorithm"
        default: 
        description: The hash function to verify the file
        examples:
        - algorithm: md5
          value: bc9bf7fedde0e700b974426fbd8d869c
    title: DependencyVerification
    type: object
  FileHashAlgorithm:
    description: |-
      Represents a file hash algorithm and its value. Used for verification against an
      existing file.
      ```py
      try:
          file_algo = FileHashAlgorithm(algorithm=HashAlgorithm.SHA256, value='...')
          file_algo.verify_file(CspPath('file://path/to/file'))
      except ValidationError as e:
          print(e)
      ```
    properties:
      algorithm:
        "$ref": "#/$defs/HashAlgorithm"
      value:
        title: Value
        type: string
    required:
    - algorithm
    - value
    title: FileHashAlgorithm
    type: object
  HashAlgorithm:
    description: Represents the supported hashing algorithms
    enum:
    - md5
    - sha1
    - sha256
    - sha512
    title: HashAlgorithm
    type: string
  RuntimeDependency:
    description: |-
      The runtime dependency required by the tools Jar cmd. All elements are downloaded and added
      to the classPath.
    properties:
      name:
        description: The name of the dependency
        title: Name
        type: string
      uri:
        anyOf:
        - format: uri
          minLength: 1
          type: string
        - format: file-path
          type: string
        description: The FileURI of the dependency. It can be a local file or a remote
          file
        examples:
        - file:///path/to/file.jar
        - https://mvn-url/24.08.1/rapids-4-spark-tools_2.12-24.08.1.jar
        - gs://bucket-name/path/to/file.jar
        title: Uri
      dependency_type:
        "$ref": "#/$defs/RuntimeDependencyType"
        description: The type of the dependency and how to find the lib files after
          decompression.
      verification:
        "$ref": "#/$defs/DependencyVerification"
        default: 
        examples:
        - size: 3265393
        - fileHash:
            algorithm: md5
            value: bc9bf7fedde0e700b974426fbd8d869c
    required:
    - name
    - uri
    title: RuntimeDependency
    type: object
  RuntimeDependencyType:
    description: |-
      Defines the type of dependency. It can be one of the following:
         - Archived file (.tgz)
         - Simple JAR file (*.jar)
         - Classpath directory (not yet supported)

      Note: The 'classpath' type is reserved for future use, allowing users to point to a directory
      in the classpath without needing to download or copy any binaries.
    properties:
      dep_type:
        "$ref": "#/$defs/DependencyType"
        description: The type of the dependency
      relative_path:
        default: 
        description: The relative path of the dependency in the classpath. This is
          relevant for tar files
        examples:
        - jars/*
        title: Relative Path
        type: string
    required:
    - dep_type
    title: RuntimeDependencyType
    type: object
  ToolsConfigInfo:
    description: |-
      High level metadata about the tools configurations (i.e., the api-version and any other relevant
      metadata).
    properties:
      api_version:
        description: The version of the API that the tools are using. This is used
          to test the compatibility of the configuration file against the current
          tools release
        examples:
        - '1.0'
        maximum: 1
        minimum: 1
        title: Api Version
        type: number
    required:
    - api_version
    title: ToolsConfigInfo
    type: object
  ToolsRuntimeConfig:
    description: The runtime configurations of the tools as defined by the user.
    properties:
      dependencies:
        description: The list of runtime dependencies required by the tools Jar cmd.
          All elements are downloaded and added to the classPath
        items:
          "$ref": "#/$defs/RuntimeDependency"
        title: Dependencies
        type: array
    required:
    - dependencies
    title: ToolsRuntimeConfig
    type: object
description: Main container for the user's defined tools configuration
properties:
  info:
    "$ref": "#/$defs/ToolsConfigInfo"
    description: Metadata information of the tools configuration
  runtime:
    "$ref": "#/$defs/ToolsRuntimeConfig"
    description: Configuration related to the runtime environment of the tools
required:
- info
- runtime
title: ToolsConfig
type: object

detailed code changes

This pull request includes several changes to the user_tools module, focusing on improving dependency management and configuration handling. The changes include importing new modules, adding methods for handling tool configurations, and updating the way dependencies are processed and verified.

Configuration Handling Enhancements:

Minor Code Improvements:

Modified the format of platform configuration to match the same structure

Signed-off-by: Ahmed Hussein <[email protected]>

Fixes NVIDIA#1359

Add a new input argument that takes a path to a yaml file `--tools_config_file`
The config file allows the users to define their own binaries that need
to be added to the classpath of the tools jar cmd.
This change is important because users can use the user-tools wrapper
with their custom spark.
@amahussein amahussein added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) usability track issues related to the Tools's user experience labels Oct 24, 2024
@amahussein amahussein self-assigned this Oct 24, 2024
Signed-off-by: Ahmed Hussein <[email protected]>
@amahussein amahussein changed the title Allow users to specify custom Dependency jars [FEA] Allow users to specify custom Dependency jars Oct 24, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein for this PR. Adding a module for configurations using pydantic model was interesting. Minor comments.

Signed-off-by: Ahmed Hussein <[email protected]>
parthosa
parthosa previously approved these changes Oct 25, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein. LGTME. The configuration module could be very useful in future. Setting runtime properties (e.g. thresholds for TCV, distributed spark configs etc)

nartal1
nartal1 previously approved these changes Oct 25, 2024
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein ! LGTM.

cindyyuanjiang
cindyyuanjiang previously approved these changes Oct 27, 2024
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein! Very minor nits.

@tgravescs
Copy link
Collaborator

tgravescs commented Oct 28, 2024

one general comment about the description is that it doesn't seem like the section "The specification of the file is as follows".. matches what the real spec is. Or atl east its very hard to read and match them up. was this auto generated?

For instance:

DependencyType:
    description: Represents the dependency type for the jar cmd
    enum:
    - jar
    - archive
    - classpath
    title: DependencyType
    type: string

Description is dependency type for the jar command (what is jar command here?)... but enum is jar, archive, or classpath. It doesn't seem to match dependency_type field in the example above it.

dependency_type:
        dep_type: archive
        relative_path: jars/*

I assume some fields are optional.. like dependency_type: is only listed in your example for one of the 3 dependencies?

Also the description doesn't detail how this interacts with the default spark downloads, please add those details.

@tgravescs
Copy link
Collaborator

when I run with a tools config file with only the info and api version it fails with:

Traceback (most recent call last):
....
TypeError: argument 'message_template': 'ValidationError' object cannot be converted to 'PyString'

Is there a way to get a more informative error message?


class ToolsConfig(BaseModel):
"""Main container for the user's defined tools configuration"""
info: ToolsConfigInfo = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info:
  api_version: '1.0'

Is there a reason the api version if under info here vs just being at the top level?
I would have expected it at the top level so that all the fields could be based on this version field (including info).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. standard is the api-version be at the root level.



class RuntimeDependency(BaseModel):
"""The runtime dependency required by the tools Jar cmd. All elements are downloaded and added
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Jar" in other places isn't capitalized. I now get what the "jar cmd" is here. its referring calling into the java command from the python side. I wonder if using like "java" instead of jar would be a more obvious to the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I am relacing the jar by Java

@@ -370,6 +372,23 @@ def process_jvm_args(self) -> None:
self.p_args['toolArgs']['jobResources'] = adjusted_resources
self.p_args['toolArgs']['log4jPath'] = Utils.resource_path('dev/log4j.properties')

def process_tools_config(self) -> None:
"""
Load the tools config file if it is provided. it creates a ToolsConfig object and sets it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit "it" should be capitalized. perhaps rename to like load_tools_config or set_tools_config so its not confusing that this is not "processing the actual dependencies"

validation_alias=AliasChoices('dep_type', 'depType'))
relative_path: str = Field(
default=None,
description='The relative path of the dependency in the classpath. This is relevant for tar files',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This is relevant for tar files".. does this means I can't use .gz files or just .jar files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will clarify the description. This field is used to tell the tools which subdirectory to find the additional jars. Therefore, it is only relevant with "archive" like spark.tgz because it produces an entire directory.
if the dependency is .jar file , then such information is not needed because the jar file is added to the classpath.

@amahussein
Copy link
Collaborator Author

one general comment about the description is that it doesn't seem like the section "The specification of the file is as follows".. matches what the real spec is. Or atl east its very hard to read and match them up. was this auto generated?

For instance:

DependencyType:
    description: Represents the dependency type for the jar cmd
    enum:
    - jar
    - archive
    - classpath
    title: DependencyType
    type: string

Description is dependency type for the jar command (what is jar command here?)... but enum is jar, archive, or classpath. It doesn't seem to match dependency_type field in the example above it.

dependency_type:
        dep_type: archive
        relative_path: jars/*

I assume some fields are optional.. like dependency_type: is only listed in your example for one of the 3 dependencies?

Also the description doesn't detail how this interacts with the default spark downloads, please add those details.

Thanks @tgravescs

I think the issue here that the autogenerated file picks both the field description and the class pydoc which causes it to be confusing.
There is some control on the autogenerated file by excluding fields, adding examples...etc. I will revisit it to pick a better explanation of the schema

amahussein and others added 2 commits October 31, 2024 12:36
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Ahmed Hussein <[email protected]>
@amahussein
Copy link
Collaborator Author

amahussein commented Oct 31, 2024

Thanks @tgravescs
As we discussed offline:

  • move the api_version at the root level
  • I updated the description of fields
  • I added unit tests to have clear understanding of valid/invalid files and the expected behavior
  • I fixed the errors to be more clear. Specifying which fields are failing.
  • I put a sample specification file in the test folders to be referenced. In addition, I added github link to the valid yaml files in the output of the help cmd.

New specifications after the changes for convenience:
https://github.com/NVIDIA/spark-rapids-tools/pull/1395/files#diff-79a9ce702cf001815401a8737b8ec035c1f7ed01d73a5360f2b4f0142807be5b

Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein! Some minor nits and questions.

Signed-off-by: Ahmed Hussein <[email protected]>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein. This was a nice approach.

Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein for adding this feature!

@amahussein amahussein merged commit ca6cac6 into NVIDIA:dev Nov 4, 2024
14 checks passed
@amahussein amahussein deleted the rapids-tools-1359-part3 branch November 4, 2024 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request usability track issues related to the Tools's user experience user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Allow users to specify custom Dependency jars (spark)
5 participants