SGLang 运行 AngelSlim/DeekSeek-R1-0528_w4a8_fp8 提示量化问题

## 环境信息
操作系统：Ubuntu 22.04.5 LTS
显卡：H200 * 8

SGLang 版本：v0.4.10.post2
模型：AngelSlim/DeekSeek-R1-0528_w4a8_fp8 ([Hugging Face](https://huggingface.co/AngelSlim/DeekSeek-R1-0528_w4a8_fp8))

## 执行命令及报错
### 命令 1
sglang 启动命令
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
MODEL_PATH=/model/DeekSeek-R1-0528_w4a8_fp8
PORT=30000
INFERENCE_TP_SIZE=8

python -m sglang.launch_server \
    --host 0.0.0.0 \
    --port ${PORT} \
    --model-path $MODEL_PATH \
    --tp $INFERENCE_TP_SIZE \
    --mem-fraction-static 0.9 \
    --trust-remote-code
```
报错信息：
```
[2025-08-08 11:23:59] server_args=ServerArgs(model_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8', tokenizer_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.9, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=923360554, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend=None, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 927, in launch_server
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 788, in _launch_subprocesses
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 197, in __init__
    self.model_config = ModelConfig.from_server_args(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 285, in from_server_args
    return ModelConfig(
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 271, in __init__
    self._verify_quantization()
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 468, in _verify_quantization
    raise ValueError(
ValueError: Unknown quantization method: w4a8_awq. Must be one of ['fp8', 'blockwise_int8', 'modelopt', 'modelopt_fp4', 'w8a8_int8', 'w8a8_fp8', 'moe_wna16', 'compressed-tensors', 'qoq', 'w4afp8', 'petit_nvfp4', 'aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fbgemm_fp8', 'marlin', 'gguf', 'gptq_marlin_24', 'awq_marlin', 'bitsandbytes', 'qqq', 'experts_int8', 'gptq_marlin', 'gptq'].
```

### 命令 2
```bash
python3 -m sglang.launch_server \
        --model-path /model/DeekSeek-R1-0528_w4a8_fp8 \
        --speculative-algorithm NEXTN \
        --speculative-num-steps 2 \
        --speculative-eagle-topk 1 \
        --speculative-num-draft-tokens 8 \
        --served-model-name deepseek-r1-0528 \
        --trust-remote-code \
        --tp 8 \
        --host 0.0.0.0 --port 30000 \
        --max-prefill-tokens 16384 \
        --max-running-requests 48 \
        --disable-radix-cache \
        --mem-fraction-static 0.9 \
        --chunked-prefill-size 8192 \
        --schedule-conservativeness 0.01 \
        --cuda-graph-max-bs=160 \
        --enable-deepep-moe \
        --attention-backend fa3 \
        --quantization w4afp8 \
        --stream-output
```

报错信息（sglang `--quantization` 支持的其他参数也都试过，都是类似的报错）：
```
NOTE: --enable-deepep-moe is deprecated. Please set `--moe-a2a-backend` to 'deepep' instead.
DeepEP MoE is enabled. The expert parallel size is adjusted to be the same as the tensor parallel size[8].
Overlap scheduler is disabled because of using eagle speculative decoding.
speculative_num_draft_tokens is adjusted to speculative_num_steps + 1 when speculative_eagle_topk == 1
[2025-08-08 11:33:16] server_args=ServerArgs(model_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8/', tokenizer_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization='w4afp8', quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.9, max_running_requests=48, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=0.01, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=True, random_seed=794962622, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='deepseek-r1-0528', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm='EAGLE', speculative_draft_model_path='/deepseek-v3/DeekSeek-R1-0528_w4a8_fp8/', speculative_num_steps=2, speculative_eagle_topk=1, speculative_num_draft_tokens=3, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=8, moe_a2a_backend='deepep', enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=True, cuda_graph_max_bs=160, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=True)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 927, in launch_server
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 788, in _launch_subprocesses
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 197, in __init__
    self.model_config = ModelConfig.from_server_args(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 285, in from_server_args
    return ModelConfig(
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 271, in __init__
    self._verify_quantization()
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 459, in _verify_quantization
    raise ValueError(
ValueError: Quantization method specified in the model config (w4a8_awq) does not match the quantization method specified in the `quantization` argument (w4afp8).
```

请问是什么原因？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SGLang 运行 AngelSlim/DeekSeek-R1-0528_w4a8_fp8 提示量化问题 #36

环境信息

执行命令及报错

命令 1

命令 2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SGLang 运行 AngelSlim/DeekSeek-R1-0528_w4a8_fp8 提示量化问题 #36

Description

环境信息

执行命令及报错

命令 1

命令 2

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions