Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

模型安装选用GPU2、3卡,但是运行总报错,提示gpu0没有资源 #2405

Closed
1 of 3 tasks
SDAIer opened this issue Oct 9, 2024 · 9 comments
Closed
1 of 3 tasks
Milestone

Comments

@SDAIer
Copy link

SDAIer commented Oct 9, 2024

System Info / 系統信息

NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2
linux

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

0.15.2

The command used to start Xinference / 用以启动 xinference 的命令

docker run

Reproduction / 复现过程

image
image

image

image

完整日志如下

经营业绩和盈利能力作出正确判断的各项交易和事项产生的损益。
二 净资产收益率及每股收益加权平均 每股收益净资产收益率(%) 基本每股收益 稀释每股收益2024 年1 至 6 月2023 年1 至 6 月2024 年1 至 6 月2023 年1 至 6 月2024 年1 至 6 月2023 年1 至 6 月归属于公司普通股股东的净利润 7.16% 7.08% 0.68 0.65 0.68 0.65扣除非经常性损益后归属于公司普通股股东的净利润 7.02% 7.01% 0.67 0.64 0.67 0.64


<|im_end|>
<|im_start|>user
总结<|im_end|>
<|im_start|>assistant
, generate config: {'echo': False, 'max_tokens': 200, 'repetition_penalty': 1.1, 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645], 'stream': True, 'stream_options': {'include_usage': False}, 'stream_interval': 2, 'temperature': 0.01, 'top_p': 0.95, 'top_k': 40, 'lora_name': None, 'request_id': None, 'model': 'qwen2.5-instruct'}
2024-10-08 20:51:53,189 xinference.core.model 5009 DEBUG [request d22a7aac-85f1-11ef-91a4-0242ac110004] Leave chat, elapsed time: 0 s
2024-10-08 20:51:53,190 xinference.core.model 5009 DEBUG After request chat, current serve request count: 0 for the model qwen2.5-instruct
2024-10-08 20:51:53,405 transformers.models.qwen2.modeling_qwen2 5009 WARNING We detected that you are passing past_key_values as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate Cache class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
2024-10-08 20:51:58,223 xinference.core.model 5009 ERROR Model actor is out of memory, model id: qwen2.5-instruct
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 332, in _to_generator
for v in gen:
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 255, in _to_chat_completion_chunks
for i, chunk in enumerate(chunks):
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/core.py", line 356, in generator_wrapper
for completion_chunk, completion_usage in generate_stream(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 36, in generator_context
response = gen.send(None)
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 178, in generate_stream
out = model(torch.as_tensor([input_ids], device=device), use_cache=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1119, in forward
logits = logits.float()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.43 GiB. GPU 0 has a total capacity of 23.50 GiB of which 9.36 GiB is free. Process 36705 has 14.13 GiB memory in use. Of the allocated memory 13.28 GiB is allocated by PyTorch, and 587.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2024-10-08 20:51:58,614 xinference.api.restful_api 1 ERROR Chat completion stream got an error: Remote server 0.0.0.0:37128 closed
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1899, in stream_results
async for item in iterator:
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 340, in anext
return await self._actor_ref.xoscar_next(self._uid)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 230, in send
result = await self._wait(future, actor_ref.address, send_message) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 115, in _wait
return await future
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 84, in _listen
raise ServerClosed(
xoscar.errors.ServerClosed: Remote server 0.0.0.0:37128 closed
2024-10-08 20:51:59,101 xinference.core.worker 140 WARNING Process 0.0.0.0:37128 is down.
2024-10-08 20:51:59,112 xinference.core.worker 140 INFO [request d5ba02a0-85f1-11ef-b572-0242ac110004] Enter terminate_model, args: <xinference.core.worker.WorkerActor object at 0x7f0105fdfdd0>,qwen2.5-instruct-1-0, kwargs: is_model_die=True
2024-10-08 20:51:59,113 xinference.core.worker 140 DEBUG Destroy model actor failed, model uid: qwen2.5-instruct-1-0, error: [Errno 111] Connection refused
2024-10-08 20:51:59,114 xinference.core.worker 140 DEBUG Remove sub pool failed, model uid: qwen2.5-instruct-1-0, error: '0.0.0.0:37128'
2024-10-08 20:51:59,114 xinference.core.worker 140 INFO [request d5ba02a0-85f1-11ef-b572-0242ac110004] Leave terminate_model, elapsed time: 0 s
2024-10-08 20:51:59,114 xinference.core.worker 140 WARNING Recreating model actor qwen2.5-instruct-1-0 ...
2024-10-08 20:51:59,115 xinference.core.worker 140 INFO [request d5ba6bbe-85f1-11ef-b572-0242ac110004] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7f0105fdfdd0>, kwargs: model_uid=qwen2.5-instruct-1-0,model_name=qwen2.5-instruct,model_size_in_billions=3,model_format=pytorch,quantization=none,model_engine=Transformers,model_type=LLM,n_gpu=2,peft_model_config=None,request_limits=None,gpu_idx=None,download_hub=None,model_path=None,max_model_len=30000
2024-10-08 20:51:59,117 xinference.core.worker 140 DEBUG GPU selected: [2, 3] for model qwen2.5-instruct-1-0
2024-10-08 20:52:05,452 xinference.model.llm.core 140 DEBUG Launching qwen2.5-instruct-1-0 with PytorchChatModel
2024-10-08 20:52:05,453 xinference.model.llm.llm_family 140 INFO Caching from Modelscope: qwen/Qwen2.5-3B-Instruct
2024-10-08 20:52:05,453 xinference.model.llm.llm_family 140 INFO Cache /root/.xinference/cache/qwen2_5-instruct-pytorch-3b exists
2024-10-08 20:52:05,596 transformers.tokenization_utils_base 5251 INFO loading file vocab.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file merges.txt
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file tokenizer.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file added_tokens.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file special_tokens_map.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file tokenizer_config.json
2024-10-08 20:52:05,800 transformers.tokenization_utils_base 5251 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-10-08 20:52:05,801 transformers.configuration_utils 5251 INFO loading configuration file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/config.json
2024-10-08 20:52:05,802 transformers.configuration_utils 5251 INFO Model config Qwen2Config {
"_name_or_path": "/root/.xinference/cache/qwen2_5-instruct-pytorch-3b",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 16,
"num_hidden_layers": 36,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "float16",
"transformers_version": "4.44.2",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}

2024-10-08 20:52:05,942 transformers.modeling_utils 5251 INFO loading weights file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/model.safetensors.index.json
2024-10-08 20:52:05,942 transformers.modeling_utils 5251 INFO Instantiating Qwen2ForCausalLM model under default dtype torch.float16.
2024-10-08 20:52:05,943 transformers.generation.configuration_utils 5251 INFO Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}

Loading checkpoint shards: 100%|█████████████████| 2/2 [00:02<00:00, 1.37s/it]
2024-10-08 20:52:09,374 transformers.modeling_utils 5251 INFO All model checkpoint weights were used when initializing Qwen2ForCausalLM.

2024-10-08 20:52:09,374 transformers.modeling_utils 5251 INFO All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /root/.xinference/cache/qwen2_5-instruct-pytorch-3b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
2024-10-08 20:52:09,378 transformers.generation.configuration_utils 5251 INFO loading configuration file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/generation_config.json
2024-10-08 20:52:09,378 transformers.generation.configuration_utils 5251 INFO Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}

2024-10-08 20:52:09,620 xinference.model.llm.transformers.core 5251 DEBUG Model Memory: 6775866368
2024-10-08 20:52:09,623 xinference.core.worker 140 INFO [request d5ba6bbe-85f1-11ef-b572-0242ac110004] Leave launch_builtin_model, elapsed time: 10 s

Expected behavior / 期待表现

正常使用gpu资源

@XprobeBot XprobeBot added the gpu label Oct 9, 2024
@XprobeBot XprobeBot added this to the v0.15 milestone Oct 9, 2024
@turndown
Copy link

turndown commented Oct 9, 2024

我也遇到差不多的问题,跑sd3的模型可以在gpu3设置,但是跑flux也设置gpu3就跑到gpu0去了,有点奇怪

Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Oct 16, 2024
@SDAIer
Copy link
Author

SDAIer commented Oct 17, 2024 via email

@github-actions github-actions bot removed the stale label Oct 17, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Oct 24, 2024
Copy link

This issue was closed because it has been inactive for 5 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 30, 2024
@frankSARU
Copy link

image
我有同样的问题,我的模型刻意避开了GPU 0但是仍然会报这个错

@JUN-ZZ
Copy link

JUN-ZZ commented Nov 27, 2024

GPU 0 has a total capacity of 怎么还没有人解决啊 过去那么久了

@JUN-ZZ
Copy link

JUN-ZZ commented Nov 27, 2024

@frankSARU @SDAIer 大佬 解决了吗 GPU 0 has a total capacity of 这个问题

@frankSARU
Copy link

@frankSARU @SDAIer 大佬 解决了吗 GPU 0 has a total capacity of 这个问题

没有解决,但是猜测和embedding输入的参数太长有关,我暂时下调了chunk_size和max_token可以一定程度解决类似问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants