-
Notifications
You must be signed in to change notification settings - Fork 129
fix: ray module not found handling #1049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Please check this fix if you can, @d4l3k, @kiukchung, @tonykao8080 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@andywag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@clumsy Can you add the type to quoted_string quoted_values: List[str] = []? |
Summary: TorchX has been handling `ModuleNotFoundError` gracefully for a while now, e.g. for SageMaker when running `torchx runopts` we get: ``` ... (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container quiet=QUIET (bool, False) whether to suppress verbose output for image building. Defaults to ``False``. aws_sagemaker: No module named 'sagemaker' gcp_batch: usage: [project=PROJECT],[location=LOCATION] ... ``` But for `ray` we get an exception after which we won't get next runopts: ``` gcp_batch: usage: [project=PROJECT],[location=LOCATION] optional arguments: project=PROJECT (str, None) Name of the GCP project. Defaults to the configured GCP project in the environment location=LOCATION (str, us-central1) Name of the location to schedule the job in. Defaults to us-central1 Traceback (most recent call last): File "/usr/local/bin/torchx", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main run_main(get_sub_cmds(), argv) File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main args.func(args) File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run opts = runner.scheduler_run_opts(scheduler) File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts return self._scheduler(scheduler).run_opts() File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler sched = factory(self._name, **self._scheduler_params) File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run module = importlib.import_module(path) File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module> session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any NameError: name 'JobSubmissionClient' is not defined ``` That's because `ray_scheduler` has custom `ModuleNotFoundException` handling - perhaps for historic reasons. Test Plan: [x] existing test must pass Differential Revision: D73751531 Pulled By: andywag
@clumsy : You had some formatting issues which I tried to overwrite but it wouldn't let me so I created another PR. You can either move the changes to this PR so we can commit or we can do it from there. No chnages other than formatting. |
a2d8a47
to
e673ee9
Compare
Fixed formatting in this PR, @andywag Regarding type changes you've asked - did you mean the other PR for quoting env variable values? |
Differential Revision: D73751531 Pull Request resolved: #1055
@andywag sorry I don't follow what's the issue you were observing, I don't see errors in |
Sorry, wrong diff for the pyre issue. I updated it on the correct one. This diff got merged with some formatting changes in 1055 |
I see, so should I still pull-rebase/fix this one on top? @andywag |
TorchX has been handling
ModuleNotFoundError
gracefully for a while now, e.g. for SageMaker when runningtorchx runopts
we get:But for
ray
we get an exception after which we won't get next runopts:That's because
ray_scheduler
has customModuleNotFoundException
handling - perhaps for historic reasons.Test plan:
[x] existing test must pass