Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: debugging guide for device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp" #6056

Closed
youkaichao opened this issue Jul 2, 2024 · 1 comment · Fixed by #6092
Labels
bug Something isn't working

Comments

@youkaichao
Copy link
Member

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

This is a compond and annoying bug, coupled with pytorch bug pytorch/pytorch#122815 .

Basically, pytorch torch.cuda.device_count function will cache the device count when first called. Users might not call it directly, but if you use import torch._dynamo , it will be called. The call chain is:

  File "/usr/local/lib/python3.10/dist-packages/torchvision/ops/roi_align.py", line 4, in <module>
    import torch._dynamo
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/__init__.py", line 2, in <module>
    from . import convert_frame, eval_frame, resume_execution
  File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 40, in <module>
    from . import config, exc, trace_rules
  File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/trace_rules.py", line 50, in <module>
    from .variables import (
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/__init__.py", line 4, in <module>
    from .builtin import BuiltinVariable
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builtin.py", line 42, in <module>
    from .ctx_manager import EventVariable, StreamVariable
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/ctx_manager.py", line 12, in <module>
    from ..device_interface import get_interface_for_device
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/device_interface.py", line 198, in <module>
    for i in range(torch.cuda.device_count()):

In our case, some image processing code will import torchvision, which implicitly import torch._dynamo:

  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 31, in <module>
    from vllm.multimodal.utils import (async_get_and_parse_image,
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/vllm/multimodal/__init__.py", line 2, in <module>
    from .registry import MultiModalRegistry
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/vllm/multimodal/registry.py", line 10, in <module>
    from .image import (ImageFeatureData, ImageFeaturePlugin, ImagePixelData,
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/vllm/multimodal/image.py", line 10, in <module>
    from vllm.transformers_utils.image_processor import get_image_processor
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/image_processor.py", line 3, in <module>
    from transformers import AutoImageProcessor
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1551, in __getattr__
    value = getattr(module, name)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1550, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1560, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py", line 27, in <module>
    from ...image_processing_utils import BaseImageProcessor, ImageProcessingMixin
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/image_processing_utils.py", line 21, in <module>
    from .image_transforms import center_crop, normalize, rescale
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/image_transforms.py", line 22, in <module>
    from .image_utils import (
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/image_utils.py", line 58, in <module>
    from torchvision.transforms import InterpolationMode
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
  File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchvision/models/__init__.py", line 2, in <module>
    from .convnext import *
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchvision/models/convnext.py", line 8, in <module>
    from ..ops.misc import Conv2dNormActivation, Permute
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchvision/ops/__init__.py", line 23, in <module>
    from .poolers import MultiScaleRoIAlign
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchvision/ops/poolers.py", line 10, in <module>
    from .roi_align import roi_align
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torchvision/ops/roi_align.py", line 4, in <module>
    import torch._dynamo

Since torch._dynamo remembers the device count, it registers a hook to initialize all devices after cuda is initialized. If we shrink CUDA_VISIBLE_DEVICES later, before we initialize cuda, then torch._dynamo will hit this error.

PyTorch fixes this bug in pytorch/pytorch#122795 .

However, before we upgrade to pytorch 2.4 , we cannot do anything.

Inside vLLM, we already use vllm.utils.cuda_device_count_stateless as much as possible. (If you see torch.cuda.device_count(), it is a bug, and we should fix it by calling vllm.utils.cuda_device_count_stateless() ).

If some other library (e.g. transformers in this case) accidentally called torch.cuda.device_count(), we cannot do anything but defer the import, as is done in #6055 .

How to find the code to blame? My current approach is to manually insert import traceback; traceback.print_stack() inside torch.cuda.device_count . Yes, modify pytorch's code, that's it. If it prints a stack trace before we initialize the engine, then we need to find the line to blame.

After deferring all possible lines to blame, we should fix this bug.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 2, 2024

Hmm, I'm getting this error for #5276:
https://buildkite.com/vllm/ci-aws/builds/3678#0190710a-4a20-4bb1-8f85-8efe1a7615a1

The stack trace suggests that import torch inside conftest.py is to blame, but I'm pretty sure the import was there from the beginning, so that can't be why.

When I try to log the traceback in torch.cuda.device_count(), I get this:

  File "/home/cyrusleung/miniconda3/envs/vllm/bin/pytest", line 8, in <module>
    sys.exit(console_main())
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/config/__init__.py", line 197, in console_main
    code = main()
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/config/__init__.py", line 174, in main
    ret: Union[ExitCode, int] = config.hook.pytest_cmdline_main(
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/main.py", line 332, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/main.py", line 285, in wrap_session
    session.exitstatus = doit(config, session) or 0
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/main.py", line 339, in _main
    config.hook.pytest_runtestloop(session=session)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/main.py", line 364, in pytest_runtestloop
    item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/runner.py", line 115, in pytest_runtest_protocol
    runtestprotocol(item, nextitem=nextitem)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/runner.py", line 134, in runtestprotocol
    reports.append(call_and_report(item, "call", log))
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/runner.py", line 239, in call_and_report
    call = CallInfo.from_call(
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/runner.py", line 340, in from_call
    result: Optional[TResult] = func()
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/runner.py", line 240, in <lambda>
    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/runner.py", line 172, in pytest_runtest_call
    item.runtest()
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/python.py", line 1772, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/_pytest/python.py", line 195, in pytest_pyfunc_call
    result = testfunction(**testargs)
  File "/home/cyrusleung/vllm-rocm/tests/distributed/test_multimodal_broadcast.py", line 43, in test_models
    run_test(
  File "/home/cyrusleung/vllm-rocm/tests/models/test_llava.py", line 113, in run_test
    with vllm_runner(model_id,
  File "/home/cyrusleung/vllm-rocm/tests/conftest.py", line 439, in __init__
    self.model = LLM(
  File "/home/cyrusleung/vllm-rocm/vllm/entrypoints/llm.py", line 144, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/cyrusleung/vllm-rocm/vllm/engine/llm_engine.py", line 405, in from_engine_args
    engine = cls(
  File "/home/cyrusleung/vllm-rocm/vllm/engine/llm_engine.py", line 238, in __init__
    self.model_executor = executor_class(
  File "/home/cyrusleung/vllm-rocm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/home/cyrusleung/vllm-rocm/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/home/cyrusleung/vllm-rocm/vllm/executor/multiproc_gpu_executor.py", line 68, in _init_executor
    self.driver_worker = self._create_worker(
  File "/home/cyrusleung/vllm-rocm/vllm/executor/gpu_executor.py", line 67, in _create_worker
    wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
  File "/home/cyrusleung/vllm-rocm/vllm/worker/worker_base.py", line 311, in init_worker
    self.worker = worker_class(*args, **kwargs)
  File "/home/cyrusleung/vllm-rocm/vllm/worker/worker.py", line 87, in __init__
    self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
  File "/home/cyrusleung/vllm-rocm/vllm/worker/model_runner.py", line 196, in __init__
    self.attn_backend = get_attn_backend(
  File "/home/cyrusleung/vllm-rocm/vllm/attention/selector.py", line 45, in get_attn_backend
    backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
  File "/home/cyrusleung/vllm-rocm/vllm/attention/selector.py", line 151, in which_attn_to_use
    if torch.cuda.get_device_capability()[0] < 8:
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
    prop = get_device_properties(device)
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 306, in _lazy_init
    queued_call()
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 173, in _check_capability
    for d in range(device_count()):
  File "/home/cyrusleung/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 745, in device_count
    import traceback; traceback.print_stack()

But I think this is supposed to happen, right?

Edit: The traceback is from a local version of the PR which has some additional changes compared to the CI build. I'll push it when its dependency has been merged so I can see whether the failure still persists.
Update: Using lazy import in vllm.transformer_utils.image_processor seems to fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants