Skip to content

SGLang 运行 AngelSlim/DeekSeek-R1-0528_w4a8_fp8 提示量化问题 #36

@r4ining

Description

@r4ining

环境信息

操作系统:Ubuntu 22.04.5 LTS
显卡:H200 * 8

SGLang 版本:v0.4.10.post2
模型:AngelSlim/DeekSeek-R1-0528_w4a8_fp8 (Hugging Face)

执行命令及报错

命令 1

sglang 启动命令

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
MODEL_PATH=/model/DeekSeek-R1-0528_w4a8_fp8
PORT=30000
INFERENCE_TP_SIZE=8

python -m sglang.launch_server \
    --host 0.0.0.0 \
    --port ${PORT} \
    --model-path $MODEL_PATH \
    --tp $INFERENCE_TP_SIZE \
    --mem-fraction-static 0.9 \
    --trust-remote-code

报错信息:

[2025-08-08 11:23:59] server_args=ServerArgs(model_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8', tokenizer_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.9, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=923360554, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend=None, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 927, in launch_server
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 788, in _launch_subprocesses
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 197, in __init__
    self.model_config = ModelConfig.from_server_args(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 285, in from_server_args
    return ModelConfig(
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 271, in __init__
    self._verify_quantization()
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 468, in _verify_quantization
    raise ValueError(
ValueError: Unknown quantization method: w4a8_awq. Must be one of ['fp8', 'blockwise_int8', 'modelopt', 'modelopt_fp4', 'w8a8_int8', 'w8a8_fp8', 'moe_wna16', 'compressed-tensors', 'qoq', 'w4afp8', 'petit_nvfp4', 'aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fbgemm_fp8', 'marlin', 'gguf', 'gptq_marlin_24', 'awq_marlin', 'bitsandbytes', 'qqq', 'experts_int8', 'gptq_marlin', 'gptq'].

命令 2

python3 -m sglang.launch_server \
        --model-path /model/DeekSeek-R1-0528_w4a8_fp8 \
        --speculative-algorithm NEXTN \
        --speculative-num-steps 2 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 8 \
        --served-model-name deepseek-r1-0528 \
        --trust-remote-code \
        --tp 8 \
        --host 0.0.0.0 --port 30000 \
        --max-prefill-tokens 16384 \
        --max-running-requests 48 \
        --disable-radix-cache \
        --mem-fraction-static 0.9 \
        --chunked-prefill-size 8192 \
        --schedule-conservativeness 0.01 \
        --cuda-graph-max-bs=160 \
        --enable-deepep-moe \
        --attention-backend fa3 \
        --quantization w4afp8 \
        --stream-output

报错信息(sglang --quantization 支持的其他参数也都试过,都是类似的报错):

NOTE: --enable-deepep-moe is deprecated. Please set `--moe-a2a-backend` to 'deepep' instead.
DeepEP MoE is enabled. The expert parallel size is adjusted to be the same as the tensor parallel size[8].
Overlap scheduler is disabled because of using eagle speculative decoding.
speculative_num_draft_tokens is adjusted to speculative_num_steps + 1 when speculative_eagle_topk == 1
[2025-08-08 11:33:16] server_args=ServerArgs(model_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8/', tokenizer_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization='w4afp8', quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.9, max_running_requests=48, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=0.01, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=True, random_seed=794962622, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='deepseek-r1-0528', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm='EAGLE', speculative_draft_model_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8/', speculative_num_steps=2, speculative_eagle_topk=1, speculative_num_draft_tokens=3, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=8, moe_a2a_backend='deepep', enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=True, cuda_graph_max_bs=160, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=True)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 927, in launch_server
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 788, in _launch_subprocesses
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 197, in __init__
    self.model_config = ModelConfig.from_server_args(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 285, in from_server_args
    return ModelConfig(
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 271, in __init__
    self._verify_quantization()
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 459, in _verify_quantization
    raise ValueError(
ValueError: Quantization method specified in the model config (w4a8_awq) does not match the quantization method specified in the `quantization` argument (w4afp8).

请问是什么原因?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions