-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Allow users to specify custom Dependency jars #1395
Conversation
Signed-off-by: Ahmed Hussein <[email protected]> Fixes NVIDIA#1359 Add a new input argument that takes a path to a yaml file `--tools_config_file` The config file allows the users to define their own binaries that need to be added to the classpath of the tools jar cmd. This change is important because users can use the user-tools wrapper with their custom spark.
Signed-off-by: Ahmed Hussein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein for this PR. Adding a module for configurations using pydantic model was interesting. Minor comments.
Signed-off-by: Ahmed Hussein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein. LGTME. The configuration module could be very useful in future. Setting runtime properties (e.g. thresholds for TCV, distributed spark configs etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein ! LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein! Very minor nits.
one general comment about the description is that it doesn't seem like the section "The specification of the file is as follows".. matches what the real spec is. Or atl east its very hard to read and match them up. was this auto generated? For instance:
Description is dependency type for the jar command (what is jar command here?)... but enum is jar, archive, or classpath. It doesn't seem to match dependency_type field in the example above it.
I assume some fields are optional.. like dependency_type: is only listed in your example for one of the 3 dependencies? Also the description doesn't detail how this interacts with the default spark downloads, please add those details. |
when I run with a tools config file with only the info and api version it fails with:
Is there a way to get a more informative error message? |
|
||
class ToolsConfig(BaseModel): | ||
"""Main container for the user's defined tools configuration""" | ||
info: ToolsConfigInfo = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info:
api_version: '1.0'
Is there a reason the api version if under info here vs just being at the top level?
I would have expected it at the top level so that all the fields could be based on this version field (including info).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. standard is the api-version be at the root level.
|
||
|
||
class RuntimeDependency(BaseModel): | ||
"""The runtime dependency required by the tools Jar cmd. All elements are downloaded and added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Jar" in other places isn't capitalized. I now get what the "jar cmd" is here. its referring calling into the java command from the python side. I wonder if using like "java" instead of jar would be a more obvious to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I am relacing the jar by Java
@@ -370,6 +372,23 @@ def process_jvm_args(self) -> None: | |||
self.p_args['toolArgs']['jobResources'] = adjusted_resources | |||
self.p_args['toolArgs']['log4jPath'] = Utils.resource_path('dev/log4j.properties') | |||
|
|||
def process_tools_config(self) -> None: | |||
""" | |||
Load the tools config file if it is provided. it creates a ToolsConfig object and sets it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit "it" should be capitalized. perhaps rename to like load_tools_config or set_tools_config so its not confusing that this is not "processing the actual dependencies"
validation_alias=AliasChoices('dep_type', 'depType')) | ||
relative_path: str = Field( | ||
default=None, | ||
description='The relative path of the dependency in the classpath. This is relevant for tar files', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This is relevant for tar files".. does this means I can't use .gz files or just .jar files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will clarify the description. This field is used to tell the tools which subdirectory to find the additional jars. Therefore, it is only relevant with "archive" like spark.tgz because it produces an entire directory.
if the dependency is .jar
file , then such information is not needed because the jar file is added to the classpath.
Thanks @tgravescs I think the issue here that the autogenerated file picks both the field description and the class pydoc which causes it to be confusing. |
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Ahmed Hussein <[email protected]>
153661e
Thanks @tgravescs
New specifications after the changes for convenience: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein! Some minor nits and questions.
Signed-off-by: Ahmed Hussein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein. This was a nice approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein for adding this feature!
Signed-off-by: Ahmed Hussein [email protected]
Fixes #1359
Add a new input argument that takes a path to a yaml file
--tools_config_file
The config file allows the users to define their own binaries that need to be added to the classpath of the tools jar cmd. This change is important because users can use the user-tools wrapper with their custom spark.Sample usage and specifications
the dependenices can be
https:
orfile:///
assume that the spark binaries are for
spark-3.3.1-bin-hadoop3
which exists on local disk:/local/path/to/spark-3.3.1-bin-hadoop3.tgz
We can define the config-file
/home/tools-conf.yaml
as follows:now, we can run the python command as :
Assume that there are more jar files needed for the classpath, then the user can define an array of dependencies
The specification of the file is as follows
detailed code changes
This pull request includes several changes to the
user_tools
module, focusing on improving dependency management and configuration handling. The changes include importing new modules, adding methods for handling tool configurations, and updating the way dependencies are processed and verified.Configuration Handling Enhancements:
user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py
: Added a methodget_tools_config_obj
to retrieve the tools configuration object from CLI arguments.user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py
: Added a methodpopulate_dependency_list
to check for dependencies in a config file before falling back to default dependencies.Minor Code Improvements:
user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py
: Corrected a comment typo in_process_output_args
method.user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py
: Adjusted the formatting of theget_rapids_tools_dependencies
method for better readability.user_tools/src/spark_rapids_pytools/resources/dev/prepackage_mgr.py
: Added import forCspPath
andRuntimeDependency
to support new dependency handling. [1] [2]Modified the format of platform configuration to match the same structure
user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py
: Added imports forRuntimeDependency
,ToolsConfig
, andDependencyType
to better manage dependencies. Updated methods to useRuntimeDependency
objects instead of dictionaries. [1] [2]user_tools/src/spark_rapids_pytools/resources/databricks_aws-configs.json
: ReplacedhashLib
withfileHash
for verification and addeddependencyType
for better dependency classification. [1] [2] [3]user_tools/src/spark_rapids_pytools/resources/databricks_azure-configs.json
: Similar changes to the AWS configs, updating verification and dependency type fields.user_tools/src/spark_rapids_pytools/resources/dataproc-configs.json
: Updated verification and dependency type fields for Dataproc configurations.user_tools/src/spark_rapids_pytools/resources/dataproc_gke-configs.json
: Similar updates to verification and dependency type fields for GKE configurations.