-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
模型安装选用GPU2、3卡,但是运行总报错,提示gpu0没有资源 #2405
Comments
我也遇到差不多的问题,跑sd3的模型可以在gpu3设置,但是跑flux也设置gpu3就跑到gpu0去了,有点奇怪 |
This issue is stale because it has been open for 7 days with no activity. |
any updated?
…---原始邮件---
发件人: ***@***.***>
发送时间: 2024年10月17日(周四) 凌晨3:04
收件人: ***@***.***>;
抄送: ***@***.******@***.***>;
主题: Re: [xorbitsai/inference] 模型安装选用GPU2、3卡,但是运行总报错,提示gpu0没有资源 (Issue #2405)
This issue is stale because it has been open for 7 days with no activity.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
This issue is stale because it has been open for 7 days with no activity. |
This issue was closed because it has been inactive for 5 days since being marked as stale. |
GPU 0 has a total capacity of 怎么还没有人解决啊 过去那么久了 |
@frankSARU @SDAIer 大佬 解决了吗 GPU 0 has a total capacity of 这个问题 |
没有解决,但是猜测和embedding输入的参数太长有关,我暂时下调了chunk_size和max_token可以一定程度解决类似问题 |
System Info / 系統信息
NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2
linux
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
0.15.2
The command used to start Xinference / 用以启动 xinference 的命令
docker run
Reproduction / 复现过程
完整日志如下
经营业绩和盈利能力作出正确判断的各项交易和事项产生的损益。
二 净资产收益率及每股收益加权平均 每股收益净资产收益率(%) 基本每股收益 稀释每股收益2024 年1 至 6 月2023 年1 至 6 月2024 年1 至 6 月2023 年1 至 6 月2024 年1 至 6 月2023 年1 至 6 月归属于公司普通股股东的净利润 7.16% 7.08% 0.68 0.65 0.68 0.65扣除非经常性损益后归属于公司普通股股东的净利润 7.02% 7.01% 0.67 0.64 0.67 0.64
<|im_end|>
<|im_start|>user
总结<|im_end|>
<|im_start|>assistant
, generate config: {'echo': False, 'max_tokens': 200, 'repetition_penalty': 1.1, 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645], 'stream': True, 'stream_options': {'include_usage': False}, 'stream_interval': 2, 'temperature': 0.01, 'top_p': 0.95, 'top_k': 40, 'lora_name': None, 'request_id': None, 'model': 'qwen2.5-instruct'}
2024-10-08 20:51:53,189 xinference.core.model 5009 DEBUG [request d22a7aac-85f1-11ef-91a4-0242ac110004] Leave chat, elapsed time: 0 s
2024-10-08 20:51:53,190 xinference.core.model 5009 DEBUG After request chat, current serve request count: 0 for the model qwen2.5-instruct
2024-10-08 20:51:53,405 transformers.models.qwen2.modeling_qwen2 5009 WARNING We detected that you are passing
past_key_values
as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriateCache
class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)2024-10-08 20:51:58,223 xinference.core.model 5009 ERROR Model actor is out of memory, model id: qwen2.5-instruct
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 332, in _to_generator
for v in gen:
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 255, in _to_chat_completion_chunks
for i, chunk in enumerate(chunks):
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/core.py", line 356, in generator_wrapper
for completion_chunk, completion_usage in generate_stream(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 36, in generator_context
response = gen.send(None)
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 178, in generate_stream
out = model(torch.as_tensor([input_ids], device=device), use_cache=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1119, in forward
logits = logits.float()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.43 GiB. GPU 0 has a total capacity of 23.50 GiB of which 9.36 GiB is free. Process 36705 has 14.13 GiB memory in use. Of the allocated memory 13.28 GiB is allocated by PyTorch, and 587.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2024-10-08 20:51:58,614 xinference.api.restful_api 1 ERROR Chat completion stream got an error: Remote server 0.0.0.0:37128 closed
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1899, in stream_results
async for item in iterator:
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 340, in anext
return await self._actor_ref.xoscar_next(self._uid)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 230, in send
result = await self._wait(future, actor_ref.address, send_message) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 115, in _wait
return await future
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 84, in _listen
raise ServerClosed(
xoscar.errors.ServerClosed: Remote server 0.0.0.0:37128 closed
2024-10-08 20:51:59,101 xinference.core.worker 140 WARNING Process 0.0.0.0:37128 is down.
2024-10-08 20:51:59,112 xinference.core.worker 140 INFO [request d5ba02a0-85f1-11ef-b572-0242ac110004] Enter terminate_model, args: <xinference.core.worker.WorkerActor object at 0x7f0105fdfdd0>,qwen2.5-instruct-1-0, kwargs: is_model_die=True
2024-10-08 20:51:59,113 xinference.core.worker 140 DEBUG Destroy model actor failed, model uid: qwen2.5-instruct-1-0, error: [Errno 111] Connection refused
2024-10-08 20:51:59,114 xinference.core.worker 140 DEBUG Remove sub pool failed, model uid: qwen2.5-instruct-1-0, error: '0.0.0.0:37128'
2024-10-08 20:51:59,114 xinference.core.worker 140 INFO [request d5ba02a0-85f1-11ef-b572-0242ac110004] Leave terminate_model, elapsed time: 0 s
2024-10-08 20:51:59,114 xinference.core.worker 140 WARNING Recreating model actor qwen2.5-instruct-1-0 ...
2024-10-08 20:51:59,115 xinference.core.worker 140 INFO [request d5ba6bbe-85f1-11ef-b572-0242ac110004] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7f0105fdfdd0>, kwargs: model_uid=qwen2.5-instruct-1-0,model_name=qwen2.5-instruct,model_size_in_billions=3,model_format=pytorch,quantization=none,model_engine=Transformers,model_type=LLM,n_gpu=2,peft_model_config=None,request_limits=None,gpu_idx=None,download_hub=None,model_path=None,max_model_len=30000
2024-10-08 20:51:59,117 xinference.core.worker 140 DEBUG GPU selected: [2, 3] for model qwen2.5-instruct-1-0
2024-10-08 20:52:05,452 xinference.model.llm.core 140 DEBUG Launching qwen2.5-instruct-1-0 with PytorchChatModel
2024-10-08 20:52:05,453 xinference.model.llm.llm_family 140 INFO Caching from Modelscope: qwen/Qwen2.5-3B-Instruct
2024-10-08 20:52:05,453 xinference.model.llm.llm_family 140 INFO Cache /root/.xinference/cache/qwen2_5-instruct-pytorch-3b exists
2024-10-08 20:52:05,596 transformers.tokenization_utils_base 5251 INFO loading file vocab.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file merges.txt
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file tokenizer.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file added_tokens.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file special_tokens_map.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file tokenizer_config.json
2024-10-08 20:52:05,800 transformers.tokenization_utils_base 5251 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-10-08 20:52:05,801 transformers.configuration_utils 5251 INFO loading configuration file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/config.json
2024-10-08 20:52:05,802 transformers.configuration_utils 5251 INFO Model config Qwen2Config {
"_name_or_path": "/root/.xinference/cache/qwen2_5-instruct-pytorch-3b",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 16,
"num_hidden_layers": 36,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "float16",
"transformers_version": "4.44.2",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}
2024-10-08 20:52:05,942 transformers.modeling_utils 5251 INFO loading weights file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/model.safetensors.index.json
2024-10-08 20:52:05,942 transformers.modeling_utils 5251 INFO Instantiating Qwen2ForCausalLM model under default dtype torch.float16.
2024-10-08 20:52:05,943 transformers.generation.configuration_utils 5251 INFO Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}
Loading checkpoint shards: 100%|█████████████████| 2/2 [00:02<00:00, 1.37s/it]
2024-10-08 20:52:09,374 transformers.modeling_utils 5251 INFO All model checkpoint weights were used when initializing Qwen2ForCausalLM.
2024-10-08 20:52:09,374 transformers.modeling_utils 5251 INFO All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /root/.xinference/cache/qwen2_5-instruct-pytorch-3b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
2024-10-08 20:52:09,378 transformers.generation.configuration_utils 5251 INFO loading configuration file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/generation_config.json
2024-10-08 20:52:09,378 transformers.generation.configuration_utils 5251 INFO Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}
2024-10-08 20:52:09,620 xinference.model.llm.transformers.core 5251 DEBUG Model Memory: 6775866368
2024-10-08 20:52:09,623 xinference.core.worker 140 INFO [request d5ba6bbe-85f1-11ef-b572-0242ac110004] Leave launch_builtin_model, elapsed time: 10 s
Expected behavior / 期待表现
正常使用gpu资源
The text was updated successfully, but these errors were encountered: