Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] vllm CLI flags should be ordered for better user readability #10017

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

chaunceyjiang
Copy link
Contributor

FIX #10016

vllm CLI flags should be ordered for better user readability

Copy link

github-actions bot commented Nov 5, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@chaunceyjiang
Copy link
Contributor Author

Test

# vllm serve --help

usage: vllm serve <model_tag> [options]

positional arguments:
  model_tag             The model tag to serve

options:
  --allow-credentials   allow credentials
  --allowed-headers ALLOWED_HEADERS
                        allowed headers
  --allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH
                        Allowing API requests to read local images or videosfrom directories specified by the server file system.This is a security risk.Should only be enabled in
                        trusted environments
  --allowed-methods ALLOWED_METHODS
                        allowed methods
  --allowed-origins ALLOWED_ORIGINS
                        allowed origins
  --api-key API_KEY     If provided, the server will require this key to be presented in the header.
  --block-size {8,16,32}
                        Token block size for contiguous chunks of tokens. This is ignored on neuron devices and set to max-model-len
  --chat-template CHAT_TEMPLATE
                        The file path to the chat template, or the template in single-line form for the specified model
  --chat-template-text-format {string,openai}
                        The format to render text content within a chat template. "string" will keep the content field as a string whereas "openai" will parse content in the current
                        OpenAI format.
  --code-revision CODE_REVISION
                        The specific revision to use for the model code on Hugging Face Hub. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default
                        version.
  --collect-detailed-traces COLLECT_DETAILED_TRACES
                        Valid choices are model,worker,all. It makes sense to set this only if --otlp-traces-endpoint is set. If set, it will collect detailed traces for the specified
                        modules. This involves use of possibly costly and or blocking operations and hence might have a performance impact.
  --config CONFIG       Read CLI options from a config file.Must be a YAML with the following options:https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-
                        arguments-for-the-server
  --config-format {auto,hf,mistral}
                        The format of the model config to load. * "auto" will try to load the config in hf format if available else it will try to load in mistral format
  --cpu-offload-gb CPU_OFFLOAD_GB
                        The space in GiB to offload to CPU, per GPU. Default is 0, which means no offloading. Intuitively, this argument can be seen as a virtual way to increase the
                        GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with
                        BF16 weight,which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model isloaded from CPU memory to GPU
                        memory on the fly in each model forward pass.
  --device {auto,cuda,neuron,cpu,openvino,tpu,xpu}
                        Device type for vLLM execution.
  --disable-async-output-proc
                        Disable async output processing. This may result in lower performance.
  --disable-custom-all-reduce
                        See ParallelConfig.
  --disable-fastapi-docs
                        Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint
  --disable-frontend-multiprocessing
                        If specified, will run the OpenAI frontend server in the same process as the model serving engine.
  --disable-log-requests
                        Disable logging requests.
  --disable-log-stats   Disable logging statistics.
  --disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]
                        If set to True, token log probabilities are not returned during speculative decoding. If set to False, log probabilities are returned according to the settings
                        in SamplingParams. If not specified, it defaults to True. Disabling log probabilities during speculative decoding reduces latency by skipping logprob
                        calculation in proposal sampling, target sampling, and after accepted tokens are determined.
  --disable-sliding-window
                        Disables sliding window, capping to sliding window size
  --distributed-executor-backend {ray,mp}
                        Backend to use for distributed serving. When more than 1 GPU is used, will be automatically set to "ray" if installed or "mp" (multiprocessing) otherwise.
  --download-dir DOWNLOAD_DIR
                        Directory to download and load the weights, default to the default cache dir of huggingface.
  --dtype {auto,half,float16,bfloat16,float,float32}
                        Data type for model weights and activations. * "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. * "half" for FP16.
                        Recommended for AWQ quantization. * "float16" is the same as "half". * "bfloat16" for a balance between precision and range. * "float" is shorthand for FP32
                        precision. * "float32" for FP32 precision.
  --enable-auto-tool-choice
                        Enable auto tool choice for supported models. Use --tool-call-parserto specify which parser to use
  --enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]
                        If set, the prefill requests can be chunked based on the max_num_batched_tokens.
  --enable-lora         If True, enable handling of LoRA adapters.
  --enable-prefix-caching
                        Enables automatic prefix caching.
  --enable-prompt-adapter
                        If True, enable handling of PromptAdapters.
  --enforce-eager       Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.
  --fully-sharded-loras
                        By default, only half of the LoRA computation is sharded with tensor parallelism. Enabling this will use the fully sharded layers. At high sequence length, max
                        rank or tensor parallel size, this is likely faster.
  --gpu-memory-utilization GPU_MEMORY_UTILIZATION
                        The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization.
                        If unspecified, will use the default value of 0.9. This is a global gpu memory utilization limit, for example if 50% of the gpu memory is already used before
                        vLLM starts and --gpu-memory-utilization is set to 0.9, then only 40% of the gpu memory will be allocated to the model executor.
  --guided-decoding-backend {outlines,lm-format-enforcer}
                        Which engine will be used for guided decoding (JSON schema / regex etc) by default. Currently support https://github.com/outlines-dev/outlines and
                        https://github.com/noamgat/lm-format-enforcer. Can be overridden per request via guided_decoding_backend parameter.
  --host HOST           host name
  --ignore-patterns IGNORE_PATTERNS
                        The pattern(s) to ignore when loading the model.Default to 'original/**/*' to avoid repeated loading of llama's checkpoints.
  --kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}
                        Data type for kv cache storage. If "auto", will use model data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. ROCm (AMD GPU) supports fp8 (=fp8_e4m3)
  --limit-mm-per-prompt LIMIT_MM_PER_PROMPT
                        For each multimodal plugin, limit how many input instances to allow for each prompt. Expects a comma-separated list of items, e.g.: `image=16,video=2` allows a
                        maximum of 16 images and 2 videos per prompt. Defaults to 1 for each modality.
  --load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral}
                        The format of the model weights to load. * "auto" will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors
                        format is not available. * "pt" will load the weights in the pytorch bin format. * "safetensors" will load the weights in the safetensors format. * "npcache"
                        will load the weights in pytorch format and store a numpy cache to speed up the loading. * "dummy" will initialize the weights with random values, which is
                        mainly for profiling. * "tensorizer" will load the weights using tensorizer from CoreWeave. See the Tensorize vLLM Model script in the Examples section for more
                        information. * "bitsandbytes" will load the weights using bitsandbytes quantization.
  --long-lora-scaling-factors LONG_LORA_SCALING_FACTORS
                        Specify multiple scaling factors (which can be different from base model scaling factor - see eg. Long LoRA) to allow for multiple LoRA adapters trained with
                        those scaling factors to be used at the same time. If not specified, only adapters trained with the base model scaling factor are allowed.
  --lora-dtype {auto,float16,bfloat16,float32}
                        Data type for LoRA. If auto, will default to base model dtype.
  --lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE
                        Maximum size of extra vocabulary that can be present in a LoRA adapter (added to the base model vocabulary).
  --lora-modules LORA_MODULES [LORA_MODULES ...]
                        LoRA module configurations in either 'name=path' formator JSON format. Example (old format): 'name=path' Example (new format): '{"name": "name", "local_path":
                        "path", "base_model_name": "id"}'
  --max-cpu-loras MAX_CPU_LORAS
                        Maximum number of LoRAs to store in CPU memory. Must be >= than max_num_seqs. Defaults to max_num_seqs.
  --max-log-len MAX_LOG_LEN
                        Max number of prompt characters or prompt ID numbers being printed in log. Default: Unlimited
  --max-logprobs MAX_LOGPROBS
                        Max number of log probs to return logprobs is specified in SamplingParams.
  --max-lora-rank MAX_LORA_RANK
                        Max LoRA rank.
  --max-loras MAX_LORAS
                        Max number of LoRAs in a single batch.
  --max-model-len MAX_MODEL_LEN
                        Model context length. If unspecified, will be automatically derived from the model config.
  --max-num-batched-tokens MAX_NUM_BATCHED_TOKENS
                        Maximum number of batched tokens per iteration.
  --max-num-seqs MAX_NUM_SEQS
                        Maximum number of sequences per iteration.
  --max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS
                        Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
  --max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN
                        Max number of PromptAdapters tokens
  --max-prompt-adapters MAX_PROMPT_ADAPTERS
                        Max number of PromptAdapters in a batch.
  --max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE
                        Maximum sequence length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. Additionally for encoder-
                        decoder models, if the sequence length of the encoder input is larger than this, we fall back to the eager mode.
  --middleware MIDDLEWARE
                        Additional ASGI middleware to apply to the app. We accept multiple --middleware arguments. The value should be an import path. If a function is provided, vLLM
                        will add it to the server using @app.middleware('http'). If a class is provided, vLLM will add it to the server using app.add_middleware().
  --mm-processor-kwargs MM_PROCESSOR_KWARGS
                        Overrides for the multimodal input mapping/processing,e.g., image processor. For example: {"num_crops": 4}.
  --model MODEL         Name or path of the huggingface model to use.
  --model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG
                        Extra config for model loader. This will be passed to the model loader corresponding to the chosen load_format. This should be a JSON string that will be parsed
                        into a dictionary.
  --multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]
                        If False, then multi-step will stream outputs at the end of all steps
  --ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX
                        Max size of window for ngram prompt lookup in speculative decoding.
  --ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN
                        Min size of window for ngram prompt lookup in speculative decoding.
  --no-pooling-norm     Used to determine whether to normalize the pooled data in the embedding model.
  --no-pooling-softmax  Used to determine whether to softmax the pooled data in the embedding model.
  --num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE
                        If specified, ignore GPU profiling result and use this numberof GPU blocks. Used for testing preemption.
  --num-lookahead-slots NUM_LOOKAHEAD_SLOTS
                        Experimental scheduling config necessary for speculative decoding. This will be replaced by speculative config in the future; it is present to enable
                        correctness tests until then.
  --num-scheduler-steps NUM_SCHEDULER_STEPS
                        Maximum number of forward steps per scheduler call.
  --num-speculative-tokens NUM_SPECULATIVE_TOKENS
                        The number of speculative tokens to sample from the draft model in speculative decoding.
  --otlp-traces-endpoint OTLP_TRACES_ENDPOINT
                        Target URL to which OpenTelemetry traces will be sent.
  --override-neuron-config OVERRIDE_NEURON_CONFIG
                        Override or set neuron device configuration. e.g. {"cast_logits_dtype": "bloat16"}.'
  --pipeline-parallel-size PIPELINE_PARALLEL_SIZE, -pp PIPELINE_PARALLEL_SIZE
                        Number of pipeline stages.
  --pooling-norm        Used to determine whether to normalize the pooled data in the embedding model.
  --pooling-returned-token-ids POOLING_RETURNED_TOKEN_IDS [POOLING_RETURNED_TOKEN_IDS ...]
                        pooling-returned-token-ids represents a list of indices for the vocabulary dimensions to be extracted, such as the token IDs of good_token and bad_token in the
                        math-shepherd-mistral-7b-prm model.
  --pooling-softmax     Used to determine whether to softmax the pooled data in the embedding model.
  --pooling-step-tag-id POOLING_STEP_TAG_ID
                        When pooling-step-tag-id is not -1, it indicates that the score corresponding to the step-tag-ids in the generated sentence should be returned. Otherwise, it
                        returns the scores for all tokens.
  --pooling-type {LAST,ALL,CLS,STEP}
                        Used to configure the pooling method in the embedding model.
  --port PORT           port number
  --preemption-mode PREEMPTION_MODE
                        If 'recompute', the engine performs preemption by recomputing; If 'swap', the engine performs preemption by block swapping.
  --prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]
                        Prompt adapter configurations in the format name=path. Multiple adapters can be specified.
  --qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH
                        Name or path of the QLoRA adapter.
  --quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,experts_int8,neuron_quant,ipex,None}, -q {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,experts_int8,neuron_quant,ipex,None}
                        Method used to quantize the weights. If None, we first check the `quantization_config` attribute in the model config file. If that is None, we assume the model
                        weights are not quantized and use `dtype` to determine the data type of the weights.
  --quantization-param-path QUANTIZATION_PARAM_PATH
                        Path to the JSON file containing the KV cache scaling factors. This should generally be supplied, when KV cache dtype is FP8. Otherwise, KV cache scaling
                        factors default to 1.0, which may cause accuracy issues. FP8_E5M2 (without scaling) is only supported on cuda versiongreater than 11.8. On ROCm (AMD GPU),
                        FP8_E4M3 is instead supported for common inference criteria.
  --ray-workers-use-nsight
                        If specified, use nsight to profile Ray workers.
  --response-role RESPONSE_ROLE
                        The role name to return if `request.add_generation_prompt=true`.
  --return-tokens-as-token-ids
                        When --max-logprobs is specified, represents single tokens as strings of the form 'token_id:{token_id}' so that tokens that are not JSON-encodable can be
                        identified.
  --revision REVISION   The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
  --root-path ROOT_PATH
                        FastAPI root_path when app is behind a path based routing proxy
  --rope-scaling ROPE_SCALING
                        RoPE scaling configuration in JSON format. For example, {"rope_type":"dynamic","factor":2.0}
  --rope-theta ROPE_THETA
                        RoPE theta. Use with `rope_scaling`. In some cases, changing the RoPE theta improves the performance of the scaled model.
  --scheduler-delay-factor SCHEDULER_DELAY_FACTOR
                        Apply a delay (of delay factor multiplied by previous prompt latency) before scheduling next prompt.
  --scheduling-policy {fcfs,priority}
                        The scheduling policy to use. "fcfs" (first come first served, i.e. requests are handled in order of arrival; default) or "priority" (requests are handled based
                        on given priority (lower value means earlier handling) and time of arrival deciding any ties).
  --seed SEED           Random seed for operations.
  --served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]
                        The model name(s) used in the API. If multiple names are provided, the server will respond to any of the provided names. The model name in the model field of a
                        response will be the first name in this list. If not specified, the model name will be the same as the `--model` argument. Noted that this name(s)will also be
                        used in `model_name` tag content of prometheus metrics, if multiple names provided, metricstag will take the first one.
  --skip-tokenizer-init
                        Skip initialization of tokenizer and detokenizer
  --spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}
                        Specify the acceptance method to use during draft token verification in speculative decoding. Two types of acceptance routines are supported: 1)
                        RejectionSampler which does not allow changing the acceptance rate of draft tokens, 2) TypicalAcceptanceSampler which is configurable, allowing for a higher
                        acceptance rate at the cost of lower quality, and vice versa.
  --speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE
                        Disable speculative decoding for new incoming requests if the number of enqueue requests is larger than this value.
  --speculative-disable-mqa-scorer
                        If set to True, the MQA scorer will be disabled in speculative and fall back to batch expansion
  --speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE, -spec-draft-tp SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE
                        Number of tensor parallel replicas for the draft model in speculative decoding.
  --speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN
                        The maximum sequence length supported by the draft model. Sequences over this length will skip speculation.
  --speculative-model SPECULATIVE_MODEL
                        The name of the draft model to be used in speculative decoding.
  --speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,experts_int8,neuron_quant,ipex,None}
                        Method used to quantize the weights of speculative model. If None, we first check the `quantization_config` attribute in the model config file. If that is None,
                        we assume the model weights are not quantized and use `dtype` to determine the data type of the weights.
  --ssl-ca-certs SSL_CA_CERTS
                        The CA certificates file
  --ssl-cert-reqs SSL_CERT_REQS
                        Whether client certificate is required (see stdlib ssl module's)
  --ssl-certfile SSL_CERTFILE
                        The file path to the SSL cert file
  --ssl-keyfile SSL_KEYFILE
                        The file path to the SSL key file
  --swap-space SWAP_SPACE
                        CPU swap space size (GiB) per GPU.
  --task {auto,generate,embedding}
                        The task to use the model for. Each vLLM instance only supports one task, even if the same model can be used for multiple tasks. When the model only supports
                        one task, "auto" can be used to select it; otherwise, you must specify explicitly which task to use.
  --tensor-parallel-size TENSOR_PARALLEL_SIZE, -tp TENSOR_PARALLEL_SIZE
                        Number of tensor parallel replicas.
  --tokenizer TOKENIZER
                        Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used.
  --tokenizer-mode {auto,slow,mistral}
                        The tokenizer mode. * "auto" will use the fast tokenizer if available. * "slow" will always use the slow tokenizer. * "mistral" will always use the
                        `mistral_common` tokenizer.
  --tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG
                        Extra config for tokenizer pool. This should be a JSON string that will be parsed into a dictionary. Ignored if tokenizer_pool_size is 0.
  --tokenizer-pool-size TOKENIZER_POOL_SIZE
                        Size of tokenizer pool to use for asynchronous tokenization. If 0, will use synchronous tokenization.
  --tokenizer-pool-type TOKENIZER_POOL_TYPE
                        Type of tokenizer pool to use for asynchronous tokenization. Ignored if tokenizer_pool_size is 0.
  --tokenizer-revision TOKENIZER_REVISION
                        Revision of the huggingface tokenizer to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
  --tool-call-parser {granite-20b-fc,hermes,internlm,jamba,llama3_json,mistral} or name registered in --tool-parser-plugin
                        Select the tool call parser depending on the model that you're using. This is used to parse the model-generated tool call into OpenAI API format. Required for
                        --enable-auto-tool-choice.
  --tool-parser-plugin TOOL_PARSER_PLUGIN
                        Special the tool parser plugin write to parse the model-generated tool into OpenAI API format, the name register in this plugin can be used in --tool-call-
                        parser.
  --trust-remote-code   Trust remote code from huggingface.
  --typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA
                        A scaling factor for the entropy-based threshold for token acceptance in the TypicalAcceptanceSampler. Typically defaults to sqrt of --typical-acceptance-
                        sampler-posterior-threshold i.e. 0.3
  --typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD
                        Set the lower bound threshold for the posterior probability of a token to be accepted. This threshold is used by the TypicalAcceptanceSampler to make sampling
                        decisions during speculative decoding. Defaults to 0.09
  --use-v2-block-manager
                        [DEPRECATED] block manager v1 has been removed and SelfAttnBlockSpaceManager (i.e. block manager v2) is now the default. Setting this flag to True or False has
                        no effect on vLLM behavior.
  --uvicorn-log-level {debug,info,warning,error,critical,trace}
                        log level for uvicorn
  --worker-use-ray      Deprecated, use --distributed-executor-backend=ray.
  -h, --help            show this help message and exit

@chaunceyjiang chaunceyjiang marked this pull request as ready for review November 5, 2024 02:59
Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Isotr0py Isotr0py enabled auto-merge (squash) November 5, 2024 03:05
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 5, 2024
@Isotr0py
Copy link
Collaborator

Isotr0py commented Nov 5, 2024

BTW, perhaps we could group some feature specific CLI flags like spec decoding and embeddings in a future PR.

@chaunceyjiang
Copy link
Contributor Author

/retest

@chaunceyjiang
Copy link
Contributor Author

@Isotr0py Hi, There is currently a test case that failed, but from its log, it appears to have nothing to do with this PR. How should I retry these failed test cases.

@Isotr0py
Copy link
Collaborator

Isotr0py commented Nov 5, 2024

You can merge from main branch to re-run the test CI. I will ask for a force merge for this PR tonight if it keeps failing on unrelated tests :)

auto-merge was automatically disabled November 5, 2024 07:54

Head branch was pushed to by a user without write access

@chaunceyjiang
Copy link
Contributor Author

You can merge from main branch to re-run the test CI. I will ask for a force merge for this PR tonight if it keeps failing on unrelated tests :)

ok

@chaunceyjiang
Copy link
Contributor Author

@Isotr0py PTAL. All tests have passed.

@Isotr0py Isotr0py merged commit 93dee88 into vllm-project:main Nov 5, 2024
55 checks passed
@chaunceyjiang chaunceyjiang deleted the order_flag branch November 5, 2024 13:17
JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: vllm CLI flags should be ordered for better user readability.
2 participants