We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I met following questions: INFO 09-22 21:48:03 api_server.py:495] vLLM API server version 0.6.1 INFO 09-22 21:48:03 api_server.py:496] args: Namespace(host='0.0.0.0', port=40116, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/llms/qwen/Qwen2-VL-72B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=20000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=True, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2-VL-72B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 09-22 21:48:03 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/fe6dac8c-587f-49d9-9547-ae65fa9976a0 for RPC Path. INFO 09-22 21:48:03 api_server.py:178] Started engine process with PID 115210 WARNING 09-22 21:48:06 config.py:1650] Casting torch.bfloat16 to torch.float16. INFO 09-22 21:48:06 config.py:897] Defaulting to use mp for distributed inference WARNING 09-22 21:48:06 config.py:383] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 09-22 21:48:06 llm_engine.py:232] Initializing an LLM engine (v0.6.1) with config: model='/data/llms/qwen/Qwen2-VL-72B-Instruct', speculative_config=None, tokenizer='/data/llms/qwen/Qwen2-VL-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=20000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2-VL-72B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False) WARNING 09-22 21:48:06 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 09-22 21:48:06 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=115364) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=115365) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=115366) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=115367) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=115368) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=115369) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=115370) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last): (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device) (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device) (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError( (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method (VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226]
machine config informations as followings: 一、basic env config
source /opt/rh/devtoolset-10/enable
source activate /data/anaconda3/envs/qwen-vl
log_file="${model_server_name}_${port}.log"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server --model ${model_path} --trust-remote-code --served-model-name ${model_server_name} --enforce-eager --dtype float16 --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --host ${local_ip} --max-model-len ${max_model_len} --disable-log-stats --port ${port} > "${log_file}" 2>&1 &`
Is there any great God who can help solve the problem? Thank you very much.....
The text was updated successfully, but these errors were encountered:
As it complains, you might need to specify VLLM_WORKER_MULTIPROC_METHOD=spawn python -m ...
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m ...
After that, you probably will encounter a value error issue. See: #231
Sorry, something went wrong.
As it complains, you might need to specify VLLM_WORKER_MULTIPROC_METHOD=spawn python -m ... After that, you probably will encounter a value error issue. See: #231
Okay, thanks very much!
No branches or pull requests
I met following questions:
INFO 09-22 21:48:03 api_server.py:495] vLLM API server version 0.6.1
INFO 09-22 21:48:03 api_server.py:496] args: Namespace(host='0.0.0.0', port=40116, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/data/llms/qwen/Qwen2-VL-72B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=20000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=True, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2-VL-72B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 09-22 21:48:03 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/fe6dac8c-587f-49d9-9547-ae65fa9976a0 for RPC Path.
INFO 09-22 21:48:03 api_server.py:178] Started engine process with PID 115210
WARNING 09-22 21:48:06 config.py:1650] Casting torch.bfloat16 to torch.float16.
INFO 09-22 21:48:06 config.py:897] Defaulting to use mp for distributed inference
WARNING 09-22 21:48:06 config.py:383] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 09-22 21:48:06 llm_engine.py:232] Initializing an LLM engine (v0.6.1) with config: model='/data/llms/qwen/Qwen2-VL-72B-Instruct', speculative_config=None, tokenizer='/data/llms/qwen/Qwen2-VL-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=20000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2-VL-72B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
WARNING 09-22 21:48:06 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-22 21:48:06 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=115364) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=115365) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError(
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=115364) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError(
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=115365) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=115366) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError(
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=115366) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=115367) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError(
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=115367) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=115368) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError(
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=115368) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=115369) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=115370) INFO 09-22 21:48:06 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError(
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=115370) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 420, in set_device
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] File "/data/anaconda3/envs/qwen-vl/lib/python3.10/site-packages/torch/cuda/init.py", line 300, in _lazy_init
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] raise RuntimeError(
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=115369) ERROR 09-22 21:48:06 multiproc_worker_utils.py:226]
machine config informations as followings:
一、basic env config
Ali Cloud L20 848G, actually 845G.
NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4
python 3.10.13
torch 2.4.0
torchvision 0.19.0
transformers 4.45.0.dev0
vllm 0.6.1
二、start shell
`#!/bin/bash
model_path="/data/llms/qwen/Qwen2-VL-72B-Instruct"
model_server_name="Qwen2-VL-72B-Instruct"
max_model_len=20000
port=40116
local_ip="0.0.0.0"
切换gcc版本到10.2
source /opt/rh/devtoolset-10/enable
激活环境
source activate /data/anaconda3/envs/qwen-vl
log_file="${model_server_name}_${port}.log"
启动服务
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server
--model ${model_path}
--trust-remote-code
--served-model-name ${model_server_name}
--enforce-eager
--dtype float16
--gpu-memory-utilization 0.9
--tensor-parallel-size 8
--host ${local_ip}
--max-model-len ${max_model_len}
--disable-log-stats
--port ${port} > "${log_file}" 2>&1 &`
Is there any great God who can help solve the problem? Thank you very much.....
The text was updated successfully, but these errors were encountered: