Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLCOMPUTE-2001 | Cleanup spark related logs from setup_tron_namespace #3979

Merged
merged 2 commits into from
Oct 14, 2024

Conversation

chi-yelp
Copy link
Contributor

Includes the changes from Yelp/service_configuration_lib#151, add option to mute some logs when calling functions from setup_tron_namespace, and fix some issues.

@chi-yelp chi-yelp requested a review from a team as a code owner October 10, 2024 17:16
default_spark_pool = "batch"
valid_clusters = ["spark-pnw-prod", "pnw-devc"]
valid_clusters = ["pnw-devc-spark", "pnw-prod-spark"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align the default values with the recent puppet changes for pool validation rules

@@ -224,7 +224,7 @@ def main():
# since we need to print out what failed in either case
failed.append(service)

if args.bulk_config_fetch:
if args.dry_run and args.bulk_config_fetch:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should also skip Tron API calls here in the dry run mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

] = spark_tools.SPARK_DNS_POD_TEMPLATE
spark_conf[
"spark.kubernetes.executor.podTemplateFile"
] = spark_tools.SPARK_DNS_POD_TEMPLATE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix: we should set/overwrite the pod template file (for k8s dns mode, etc.) whether the pod template has already been set or not

@@ -224,7 +224,7 @@ def main():
# since we need to print out what failed in either case
failed.append(service)

if args.bulk_config_fetch:
if args.dry_run and args.bulk_config_fetch:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Comment on lines +179 to +181
def auto_add_timeout_for_spark_job(
cmd: str, timeout_job_runtime: str, silent: bool = False
) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw: for spark-drivers-on-k8s we should use the tron max_runtime config directly and get rid of this code entirely :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up, I think we can move this to be part of our custom spark-submit wrapper later, so it can be more easily monitored from the spark side and managed by spark configuration service & auto tuner, and also ensures the consistency of different types of spark deployments (tron, adhoc, jupyter).
(The current implementation of this also respects to the max_runtime tronfig from the caller side).

@chi-yelp chi-yelp merged commit fd8188b into master Oct 14, 2024
10 checks passed
@chi-yelp chi-yelp deleted the u/chi/reduce_setup_tron_namespace_logs branch October 14, 2024 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants