You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FlashInfer kernels guarantee the same output when given the same batch of input multiple times, but they do not guarantee that each request will generate the same output across different batching choices (#696), we call this property strong reproducibility. When the number of requests in a batch is small, split-k may be used to increase SM utilization, which changes the attention aggregation order and can lead to output variance.
This variance also exists in prefill and decode attention (#703). We have several sets of kernel implementations with slight differences (loop order, CTA tile size on KV dimension, use of CUDA cores vs. tensor cores, whether sm_scale is pre-applied on the query, etc.), all of which can affect final results. The same variability applies to GEMM kernels.
Strong reproducibility is important for numerous workloads, such as using LLMs for compression (e.g., ts_zip. A concrete example is demonstrated in this notebook, where the rank of each token in its last position’s logits output is stored. The rank array is then enough to reproduce the entire input by selecting the rank[i]’s highest choice for the i-th output in autoregressive decoding. The average number of bits in the Huffman decoding of rank array is significantly smaller than that of input token index array.
However, this approach requires strong reproducibility between the prefill (compress) and decode (decompress) stages. If users want to scale things up, we need to guarantee that the kernel generates the exact same output for each request, regardless of batching. Even a tiny error will pollute the entire decompression result. In the notebook, the data type is set to float32 because neither float16 nor bfloat16 can guarantee consistent outputs for prefill and decode, and even fp32 might not ensure reproducibility for long contexts without additional measures.
2. Proposal – ReproSpec
Strong reproducibility needs to be co-designed with LLM serving engines, not only at the kernel level. On the kernel side, we can expose an abstraction called ReproSpec, which records all necessary information for the kernel:
CTA tile-size
Abstraction about the aggregation order
Sequential / Hierarchical / Tensor Cores / All-reduce / etc.
Number of elements being accumulated.
Input and output precision.
...
Hardware information
Hash of kernel template
Other relevant reproducibility metadata
When strong reproducibility is required, the serving engine can only batch requests that share the same ReproSpec, and FlashInfer should dispatch to the kernel implementation corresponding to that ReproSpec.
When no strong reproducibility constraint is posed, the LLM serving engine can freely batch requests with different configurations and select kernels based on performance heuristics.
3. Rationale
Eliminating Numeric Variations:
Small changes in the internal accumulation order or the choice of split-k can lead to different floating-point rounding paths, causing slight deviations in output. By explicitly specifying and enforcing a ReproSpec, we ensure deterministic dataflows.
Ensuring Compression Fidelity:
The LLM-as-compressor use case is severely impacted by even minuscule differences in logits. A single wrong token rank can derail the entire decompression pipeline. ReproSpec safeguards this workflow by guaranteeing identical numerical outcomes across prefill and decode stages.
Scalability with Confidence:
As context lengths and model sizes grow, so do floating-point accumulation errors. ReproSpec provides a standard contract for predictable, reproducible performance in large-scale LLM deployments.
4. Implementation Considerations
Metadata Tracking:
The serving engine and kernel must coordinate to store and retrieve the ReproSpec metadata. This may involve hashing relevant kernel templates, CTA tiling parameters, etc.
Kernel Dispatch:
The runtime should maintain a mapping from ReproSpec to a specific kernel with the kernel configuration in the spec.
Performance vs. Reproducibility:
The serving engine should batch requests depending on whether strong reproducibility is required or not.
High-Performance Mode: No strong reproducibility constraints, allowing flexible batching and kernel choices.
Strong Reproducibility Mode: All requests sharing the same ReproSpec are batched together, ensuring identical numerical results.
The text was updated successfully, but these errors were encountered:
yzh119
changed the title
[RFC]: Introducing ReproSpec for Strong Reproducibility in FlashInfer
[RFC]: Introducing ReproSpec for Strong Reproducibility in LLM Inference
Jan 11, 2025
1. Introduction
FlashInfer kernels guarantee the same output when given the same batch of input multiple times, but they do not guarantee that each request will generate the same output across different batching choices (#696), we call this property strong reproducibility. When the number of requests in a batch is small, split-k may be used to increase SM utilization, which changes the attention aggregation order and can lead to output variance.
This variance also exists in prefill and decode attention (#703). We have several sets of kernel implementations with slight differences (loop order, CTA tile size on KV dimension, use of CUDA cores vs. tensor cores, whether
sm_scale
is pre-applied on the query, etc.), all of which can affect final results. The same variability applies to GEMM kernels.Strong reproducibility is important for numerous workloads, such as using LLMs for compression (e.g., ts_zip. A concrete example is demonstrated in this notebook, where the rank of each token in its last position’s logits output is stored. The rank array is then enough to reproduce the entire input by selecting the rank[i]’s highest choice for the i-th output in autoregressive decoding. The average number of bits in the Huffman decoding of rank array is significantly smaller than that of input token index array.
However, this approach requires strong reproducibility between the prefill (compress) and decode (decompress) stages. If users want to scale things up, we need to guarantee that the kernel generates the exact same output for each request, regardless of batching. Even a tiny error will pollute the entire decompression result. In the notebook, the data type is set to
float32
because neitherfloat16
norbfloat16
can guarantee consistent outputs for prefill and decode, and evenfp32
might not ensure reproducibility for long contexts without additional measures.2. Proposal – ReproSpec
Strong reproducibility needs to be co-designed with LLM serving engines, not only at the kernel level. On the kernel side, we can expose an abstraction called ReproSpec, which records all necessary information for the kernel:
When strong reproducibility is required, the serving engine can only batch requests that share the same ReproSpec, and FlashInfer should dispatch to the kernel implementation corresponding to that ReproSpec.
When no strong reproducibility constraint is posed, the LLM serving engine can freely batch requests with different configurations and select kernels based on performance heuristics.
3. Rationale
Eliminating Numeric Variations:
Small changes in the internal accumulation order or the choice of split-k can lead to different floating-point rounding paths, causing slight deviations in output. By explicitly specifying and enforcing a ReproSpec, we ensure deterministic dataflows.
Ensuring Compression Fidelity:
The LLM-as-compressor use case is severely impacted by even minuscule differences in logits. A single wrong token rank can derail the entire decompression pipeline. ReproSpec safeguards this workflow by guaranteeing identical numerical outcomes across prefill and decode stages.
Scalability with Confidence:
As context lengths and model sizes grow, so do floating-point accumulation errors. ReproSpec provides a standard contract for predictable, reproducible performance in large-scale LLM deployments.
4. Implementation Considerations
Metadata Tracking:
The serving engine and kernel must coordinate to store and retrieve the ReproSpec metadata. This may involve hashing relevant kernel templates, CTA tiling parameters, etc.
Kernel Dispatch:
The runtime should maintain a mapping from ReproSpec to a specific kernel with the kernel configuration in the spec.
Performance vs. Reproducibility:
The serving engine should batch requests depending on whether strong reproducibility is required or not.
cc @Ying1123 @xiezhq-hermann
The text was updated successfully, but these errors were encountered: