Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM部署后设置"CUDA_VISIBLE_DEVICES"无效 #359

Open
Jasper-LittleBrotherHeart opened this issue Feb 12, 2025 · 0 comments
Open

vLLM部署后设置"CUDA_VISIBLE_DEVICES"无效 #359

Jasper-LittleBrotherHeart opened this issue Feb 12, 2025 · 0 comments

Comments

@Jasper-LittleBrotherHeart

代码如下:
`def get_completion(prompts, model, tokenizer=None, max_tokens=512, temperature=0.8, top_p=0.95, max_model_len=2048):
stop_token_ids = [151329, 151336, 151338]
# 创建采样参数。temperature 控制生成文本的多样性,top_p 控制核心采样的概率
sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens, stop_token_ids=stop_token_ids)
# 初始化 vLLM 推理引擎
os.environ["CUDA_VISIBLE_DEVICES"] = "4,5"
print(os.environ["CUDA_VISIBLE_DEVICES"])
llm = LLM(model=model, tokenizer=tokenizer, max_model_len=max_model_len, trust_remote_code=True)
outputs = llm.generate(prompts, sampling_params)
return outputs

def load_model_paths(file_path):
with open(file_path, 'r') as f:
model_paths = f.readlines()
return [path.strip() for path in model_paths]

def load_prompts(file_path):
with open(file_path, 'r') as f:
prompts = json.load(f)
return prompts仍然报错,显示在GPU0上运行:Traceback (most recent call last):
File "test.py", line 76, in
responses = get_completion(prompts_to_process, model_path)
File "test.py", line 13, in get_completion
llm = LLM(model=model, tokenizer=tokenizer, max_model_len=max_model_len, trust_remote_code=True)
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 112, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 196, in from_engine_args
engine = cls(
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 110, in init
self.model_executor = executor_class(model_config, cache_config,
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 37, in init
self._init_worker()
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 66, in _init_worker
self.driver_worker.load_model()
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/worker/worker.py", line 107, in load_model
self.model_runner.load_model()
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 95, in load_model
self.model = get_model(
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 81, in get_model
model = model_class(model_config.hf_config, linear_method,
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 298, in init
self.model = Qwen2Model(config, linear_method)
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 237, in init
self.layers = nn.ModuleList([
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 238, in
Qwen2DecoderLayer(config, layer_idx, linear_method)
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 181, in init
self.mlp = Qwen2MLP(
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 62, in init
self.gate_up_proj = MergedColumnParallelLinear(
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/layers/linear.py", line 260, in init
super().init(input_size, sum(output_sizes), bias, gather_output,
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/layers/linear.py", line 181, in init
self.linear_weights = self.linear_method.create_weights(
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/vllm/model_executor/layers/linear.py", line 63, in create_weights
weight = Parameter(torch.empty(output_size_per_partition,
File "/n/work3/jzhao/miniconda3/envs/muser/lib/python3.8/site-packages/torch/utils/_device.py", line 77, in torch_function
return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 540.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 59.75 MiB is free. Including non-PyTorch memory, this process has 47.47 GiB memory in use. Of the allocated memory 47.01 GiB is allocated by PyTorch, and 14.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant