[TODO: 概念和代码的名字?概念是概念,代码是代码,对我们的 workflow 图明确点,然后把 KV cache management 加进来]
This doc serve as a developer-level guidance and provide a brief code walkthrough of SGLang's backend, tracing the path of how requests are processed, as shown in the following figure.
Specifically, requests flow through the following process to get responses:
-
The user launches the Server, initializing the FastAPI app, TokenizerManager, DetokenizerManager, and Scheduler, each running with its infinite event loop.
-
The user sends
/v1/chat/completions
requests to the FastAPI Server, which routes them to TokenizerManager via thev1_chat_completions
endpoint. -
The
v1_chat_completions
function converts the incoming requests into aChatCompletionRequest
,transforms them into aGenerateReqInput
and calls TokenizerManager'sgenerate_request
method. -
TokenizerManager tokenizes the requests and forwards them to the Scheduler as Python objects (
pyobj
) while calling_wait_one_response
. -
The Scheduler loops its infinite
event_loop_normal
to handle the requests:- The Scheduler receives the requests via
recv_requests
, processes them throughprocess_input_requests
, handles the generation logic withhandle_generate_request
, and adds them to thewaiting_queue
. - From the
waiting_queue
, the Scheduler usesget_next_batch_to_run
to create aScheduleBatch
for the upcoming requests. - The Scheduler executes the
run_batch
function, converting theScheduleBatch
into aModelWorkerBatch
. - The Scheduler calls TpModelWorker's
forward_batch_generation
, awaiting thelogits_output
andnext_token_ids
. - TpModelWorker initializes a
ForwardBatch
, forwards it to ModelRunner, and waits for thelogits_output
. - The ModelRunner processes the
ForwardBatch
, classifies it and callsforward_extend
to execute the model's forward pass. - The model, accelerated by
AttentionBackend
, generates logits, which are returned to ModelRunner and subsequently to TpModelWorker. - TpModelWorker receives the
logits_output
from ModelRunner, calls ModelRunner'ssample
method to generatenext_token_ids
, and sends them back to the Scheduler. - The Scheduler processes the batch results using
process_batch_result
and checks the completion status viacheck_finished
. - If requests are completed, the
process_batch_result
function adds them to the cache usingtree_cache.cache_finished_req(req)
and sends their outputs to Scheduler'sstream_output
. - In
stream_output
, Scheduler processes the outputs, wraps them intoBatchTokenIDOut
, and send to the DetokenizerManager.
- The Scheduler receives the requests via
-
The DetokenizerManager, running its own event loop, receives
BatchTokenIDOut
, processes it, and sendsBatchStrOut
back to TokenizerManager. -
The TokenizerManager, within its event loop, receives the results, processes them via
handle_loop
, updates the internal state, and yields the response to the server. -
The FastAPI Server packages the response and sends it back to the user.
[TODO: offical 可以不用 Acknowledge,仅供内部传阅,和宣传稿平级]
All the discussions are based on release v0.4.0. We sincerely appreciate Chenyang Zhao, Wenxuan Tan, Simon Veitner, Shuai Shi, Shizhe Diao, Shending Hu, Xiaoyu Zhang, agiping for their contribution to this document.
Note that this document is under construction and these parts will be included in the future.
- Radix Cache Management with Attention Backend.
get_next_batch_to_run
: How to fetch and write KV cache for requests in each batch.get_model_worker_batch
.write_req_to_token_pool_trition
.- CUDA Graphs for Attention Backend.
- Overlapping scheduling.
SGLang features an SRT (SGLang Runtime) Server for serving online HTTP requests and an Engine for offline model execution. Key functions, launch_server
and launch_engine
, are in server.py. The launch_engine
function initializes core SRT Server components.
- Set up configs (logger, server args, CUDA/NCCL env, inter-process ports) and download the model and tokenizer.
- Run Scheduler processes: Each Scheduler runs a TpModelWorker for prefill and decode, manages Radix Cache, and handles TokenizerManager requests in an infinite event loop. If
dp_size > 1
,run_data_parallel_controller_process
; otherwise, initialize a Scheduler for eachtp_rank
. - Run TokenizerManager and DetokenizerManager as subprocesses: the former tokenizes data for the Scheduler, and the latter detokenizes Scheduler outputs for the server frontend. For multi-node inference (e.g., Llama 3.1 405B on 2 nodes with 8 H100 on each node), TokenizerManager and DetokenizerManager only run on the first node.
- Apply chat templates (if specified) and wait for Scheduler processes to signal readiness while collecting their configuration.
Note that in version 0.4.0, the DataParallelController is used for round-robin scheduling of requests across multiple data parallel replicas. We will change this to SGLang Router in the future.
The Server employs a FastAPI app to define API endpoints, forwarding /v1/chat/completions
requests to the TokenizerManager via v1_chat_completions.
- Parse JSON from
raw_request
into aChatCompletionRequest
, convert it toGenerateReqInput
and configuresampling_params
usingv1_chat_generate_request
. - Call the TokenizerManager's
generate_request
and handle streaming or non-streaming responses based on thestream
parameter. - For streaming, process the
generate_request
output incrementally withgenerate_stream_resp
; for non-streaming, await the result and convert it to aChatCompletionResponse
viav1_chat_generate_response
.
TokenizerManager is initialized by launch_server
in the Server's main process to tokenize requests.
- Set up ZeroMQ for inter-process communication, including sockets for the DetokenizerManager and Scheduler.
- Configure
server_args
, enablemetrics
, and initializemodel_config
,tokenizer
, and placeholders for multi-modal image processors.
- Create an event loop if not already initialized.
- Pause processing if model weights are updating via
update_weights_from_disk
orupdate_weights_from_distributed
. - Validate request compatibility with the model's
is_generation
setting. - Normalize requests using
normalize_batch_and_arguments
to manage batching, parallel sampling, and default parameters. - Process single requests with
_tokenize_one_request
, send to the scheduler, and wait for responses from_wait_one_response
. - Process batch requests with
_handle_batch_request
, tokenize inputs, manage parallel sampling, interact with the scheduler, and yield responses in both streaming and non-streaming modes.
Scheduler runs as Server's subprocess, initialized via run_scheduler_process
and executes its infinite event loop with event_loop_normal
or event_loop_overlap
.
- Set up ZeroMQ for communication with TokenizerManager and response handling.
- Configure
server_args
,port_args
,model_config
,sessions
, and initialize TpModelWorker or TpModelWorkerClient based on overlap scheduling. - Initialize tokenizer and processor, manage caching using ChunkCache or RadixCache and configure SchedulePolicy.
- Set up chunk prefill parameters and GrammarBackend for request processing.
The Scheduler continuously executes its event loop, alternating between process_input_requests
,get_next_batch_to_run
, run_batch
and process_batch_result
.
Iterates over incoming requests, identifying their types, and dispatches them to appropriate handlers.
- Merge
last_batch
withrunning_batch
if applicable and prioritize prefill batches withget_new_batch_prefill
. - If no prefill batch, update
running_batch
for decode batch by filtering requests, managing memory, and adjusting decoding parameters.
- For generation models, use TpModelWorker’s
forward_batch_generation
for token prediction orforward_batch_idle
for idle tasks, returning results toevent_loop_normal
. - For embedding or reward models, execute
forward_batch_embedding
, and return embeddings.
In serving engines, LLM inference is usually broken into Prefill and Decode stages for their different compute charactor. You can check this post from HuggingFace regarding the concept of Prefill and Decode. In SGLang, extend mode is used instead of prefill mode most of the time. Prefill initializes KV-Cache for new requests, typically using Paged KV-Cache. Extend updates existing KV-Cache incrementally, often leveraging Ragged Tensors for efficiency, making it ideal for long sequences or multi-turn tasks.
- For generation models, use TpModelWorker's
forward_batch_generation
for token prediction orforward_batch_idle
for idle tasks, returning results to the event loop. - For embedding or reward models, execute
forward_batch_embedding
, and return embeddings.
After run_batch
, Scheduler processes batch results in event_loop_normal
:
- Decode mode processes outputs, updates request states, handles token and probability data, manages memory, and logs statistics.
- Extend mode handles prefill results, processes input tokens, and prepares for further decoding or embedding.
- Finished requests are cached via
cache_finished_req
, streamed to DetokenizerManager. Unfinished requests are updated and looped back intoget_next_batch_to_run
for further processing until completion.
Note that LLM inference is usually broken into prefill and decode stages for their different compute charactor. You can check this post from HuggingFace regarding the concept of Prefill and Decode. In SGLang, extend mode is used instead of prefill mode most of the time. Prefill initializes KV-Cache for new requests, typically using Paged KV-Cache. In contrast, extend updates existing KV-Cache incrementally, often leveraging Ragged Tensors for efficiency, making it ideal for long sequences or multi-turn tasks.
TpModelWorker manages ModelRunner's forward and sampling for batches of requests scheduled by Scheduler.
- Initializes tokenizer, model configuration and ModelRunner.
- Configures device settings and memory pool limits.
- Create a
ForwardBatch
, compute logits and sample next tokens using ModelRunner'sforward
andsample
. - Return
logits_output
andnext_token_ids
to Scheduler forprocess_batch_result
.
- Create a
ForwardBatch
, get thelogits_output
andembeddings
from ModelRunner'sforward
. - Bypass the sampling process in
forward_batch_generation
and directly return theembeddings
to Scheduler.
ModelRunner initialize the attention backend and manage the loaded model to perform forward passes for generation and embedding tasks.
Initializes distributed environment, loads the model, applies tensor parallelism, and sets up memory pool and attention backend.
The forward
function determines the appropriate forward mode for processing batches based on the forward_mode
:
forward_decode
: Initializes forward metadata and call the Model'sforward
with input IDs and positions.forward_extend
: Initializes forward metadata and call the Model'sforward
for generation or embedding tasks.forward_idle
: Manages the forward pass when the forward mode is idle.
ModelRunner's self.model
is a instance of the model class. All supported models can be found in python/sglang/srt/models. We take Qwen2ForCausalLM for example.
Qwen2ForCausalLM
is structured as follows:
model
: weights used for the forward pass.embed_tokens
: mapsinput_ids
intoembeddings
.lm_head
: projects the hidden states back to the vocabulary space.logits_processor
: manipulateslogits
to perform tasks such as sampling and normalization.pooler
: pooling mechanism for extracting embeddings or computing rewards.
The forward function in Qwen2ForCausalLM
processes input sequences to produce logits for next-token prediction of chat completion requests or embeddings for reward/embedding requests:
- Converts
input_ids
to embeddings usingembed_tokens
. Sequentially forwards embeddings through Qwen2DecoderLayer layers. - Returns embeddings via
pooler
ifget_embedding
is True; otherwise, usinglogits_processor
to computelogits
and return.
The most import acceleration comes from the interaction between forward_batch
and AttentionBackend.
SGLang supports several Attention Backends which accelerate model forward and KV cache reuse. We take FlashInferBackend for example.
- Configures wrappers for sliding window and cross-attention scenarios.
- Allocates necessary buffers for workspace and key-value indices.
- Prepares forward metadata for efficient attention computation.
- Integrates CUDA graph support for optimized execution paths.
- Decode Mode: Updates indices for decoding using
indices_updater_decode
and setsforward_metadata
to usedecode_wrappers
. - Extend Mode: Determines if ragged forward is needed based on token count and wrapper count, then updates indices with
indices_updater_prefill
. - Metadata Assignment: Sets
forward_metadata
with flags for ragged use and prefix extension.
- Selects the appropriate wrapper and decides between ragged or paged attention for
forward_extend
, or picks the decode wrapper forforward_decode
. - Computes attention, manages key-value caching, and returns the reshaped output.
DetokenizerManager is initialized by launch_server
as a subprocess of Server to detokenize requests.
Sets up communication sockets and the tokenizer. Manages decoding states with LimitedCapacityDict
.
event_loop and trim_eos
- Receives processed requests from the Scheduler, forwarding
BatchEmbeddingOut
directly or processingBatchTokenIDOut
for detokenization. - Splits token IDs into
read_ids
andsurr_ids
. Converts token IDs to text usingbatch_decode
. UpdatesDecodeStatus
with new offsets and decoded text. - Trims outputs at stop sequences, combines decoded text with metadata into
BatchStrOut
, and sends it to TokenizerManager.
- DetokenizerManager sends
BatchStrOut
to TokenizerManager via ZeroMQ. - TokenizerManager updates request states and prepares decoded text for FastAPI.
- Finally in FastAPI, for streaming, use an async generator and
StreamingResponse
to send the response to the user. - For non-streaming, collect and send the complete response using
ORJSONResponse
and return to the user.