Skip to content

fix: ray module not found handling #1049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Apr 23, 2025

TorchX has been handling ModuleNotFoundError gracefully for a while now, e.g. for SageMaker when running torchx runopts we get:

...
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

aws_sagemaker: No module named 'sagemaker'

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]
...

But for ray we get an exception after which we won't get next runopts:

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
    opts = runner.scheduler_run_opts(scheduler)
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
    return self._scheduler(scheduler).run_opts()
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
    sched = factory(self._name, **self._scheduler_params)
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
    module = importlib.import_module(path)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
    session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined

That's because ray_scheduler has custom ModuleNotFoundException handling - perhaps for historic reasons.

Test plan:
[x] existing test must pass

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2025
@clumsy
Copy link
Contributor Author

clumsy commented Apr 23, 2025

Please check this fix if you can, @d4l3k, @kiukchung, @tonykao8080

Copy link
Contributor

@andywag andywag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@facebook-github-bot
Copy link
Contributor

@andywag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@andywag
Copy link
Contributor

andywag commented Apr 28, 2025

@clumsy Can you add the type to quoted_string quoted_values: List[str] = []?

andywag pushed a commit to andywag/torchx that referenced this pull request Apr 28, 2025
Summary:
TorchX has been handling `ModuleNotFoundError` gracefully for a while now, e.g. for SageMaker when running `torchx runopts` we get:

```
...
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

aws_sagemaker: No module named 'sagemaker'

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]
...
```

But for `ray` we get an exception after which we won't get next runopts:
```
gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 118, in main
    run_main(get_sub_cmds(), argv)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/main.py", line 114, in run_main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchx/cli/cmd_runopts.py", line 36, in run
    opts = runner.scheduler_run_opts(scheduler)
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 473, in scheduler_run_opts
    return self._scheduler(scheduler).run_opts()
  File "/usr/local/lib/python3.10/dist-packages/torchx/runner/api.py", line 718, in _scheduler
    sched = factory(self._name, **self._scheduler_params)
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/__init__.py", line 39, in run
    module = importlib.import_module(path)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchx/schedulers/ray_scheduler.py", line 448, in <module>
    session_name: str, ray_client: Optional[JobSubmissionClient] = None, **kwargs: Any
NameError: name 'JobSubmissionClient' is not defined
```

That's because `ray_scheduler` has custom `ModuleNotFoundException` handling - perhaps for historic reasons.


Test Plan: [x] existing test must pass

Differential Revision: D73751531

Pulled By: andywag
@andywag
Copy link
Contributor

andywag commented Apr 28, 2025

@clumsy : You had some formatting issues which I tried to overwrite but it wouldn't let me so I created another PR. You can either move the changes to this PR so we can commit or we can do it from there. No chnages other than formatting.

@clumsy clumsy force-pushed the fix/ray_module_not_found branch from a2d8a47 to e673ee9 Compare April 29, 2025 13:00
@clumsy
Copy link
Contributor Author

clumsy commented Apr 29, 2025

Fixed formatting in this PR, @andywag

Regarding type changes you've asked - did you mean the other PR for quoting env variable values?

@andywag
Copy link
Contributor

andywag commented Apr 30, 2025

@clumsy : I'm landing this diff in #1055 with a formatting fix. Your quote diff has an issue with pyre that I mentioned either her or there. If you fix that, we can land the change.

facebook-github-bot pushed a commit that referenced this pull request May 1, 2025
Differential Revision: D73751531

Pull Request resolved: #1055
@clumsy
Copy link
Contributor Author

clumsy commented May 1, 2025

@andywag sorry I don't follow what's the issue you were observing, I don't see errors in ray_scheduler* when I run ./scripts/pyre.sh
If the fix was merged, should we close this PR then? Thanks!

@andywag
Copy link
Contributor

andywag commented May 1, 2025

Sorry, wrong diff for the pyre issue. I updated it on the correct one. This diff got merged with some formatting changes in 1055

@clumsy
Copy link
Contributor Author

clumsy commented May 1, 2025

I see, so should I still pull-rebase/fix this one on top? @andywag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants