Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Introducing ReproSpec for Strong Reproducibility in LLM Inference #733

Open
yzh119 opened this issue Jan 11, 2025 · 0 comments
Open

Comments

@yzh119
Copy link
Collaborator

yzh119 commented Jan 11, 2025

1. Introduction

FlashInfer kernels guarantee the same output when given the same batch of input multiple times, but they do not guarantee that each request will generate the same output across different batching choices (#696), we call this property strong reproducibility. When the number of requests in a batch is small, split-k may be used to increase SM utilization, which changes the attention aggregation order and can lead to output variance.

This variance also exists in prefill and decode attention (#703). We have several sets of kernel implementations with slight differences (loop order, CTA tile size on KV dimension, use of CUDA cores vs. tensor cores, whether sm_scale is pre-applied on the query, etc.), all of which can affect final results. The same variability applies to GEMM kernels.

Strong reproducibility is important for numerous workloads, such as using LLMs for compression (e.g., ts_zip. A concrete example is demonstrated in this notebook, where the rank of each token in its last position’s logits output is stored. The rank array is then enough to reproduce the entire input by selecting the rank[i]’s highest choice for the i-th output in autoregressive decoding. The average number of bits in the Huffman decoding of rank array is significantly smaller than that of input token index array.

However, this approach requires strong reproducibility between the prefill (compress) and decode (decompress) stages. If users want to scale things up, we need to guarantee that the kernel generates the exact same output for each request, regardless of batching. Even a tiny error will pollute the entire decompression result. In the notebook, the data type is set to float32 because neither float16 nor bfloat16 can guarantee consistent outputs for prefill and decode, and even fp32 might not ensure reproducibility for long contexts without additional measures.


2. Proposal – ReproSpec

Strong reproducibility needs to be co-designed with LLM serving engines, not only at the kernel level. On the kernel side, we can expose an abstraction called ReproSpec, which records all necessary information for the kernel:

  • CTA tile-size
  • Abstraction about the aggregation order
    • Sequential / Hierarchical / Tensor Cores / All-reduce / etc.
    • Number of elements being accumulated.
    • Input and output precision.
    • ...
  • Hardware information
  • Hash of kernel template
  • Other relevant reproducibility metadata

When strong reproducibility is required, the serving engine can only batch requests that share the same ReproSpec, and FlashInfer should dispatch to the kernel implementation corresponding to that ReproSpec.

When no strong reproducibility constraint is posed, the LLM serving engine can freely batch requests with different configurations and select kernels based on performance heuristics.


3. Rationale

  1. Eliminating Numeric Variations:
    Small changes in the internal accumulation order or the choice of split-k can lead to different floating-point rounding paths, causing slight deviations in output. By explicitly specifying and enforcing a ReproSpec, we ensure deterministic dataflows.

  2. Ensuring Compression Fidelity:
    The LLM-as-compressor use case is severely impacted by even minuscule differences in logits. A single wrong token rank can derail the entire decompression pipeline. ReproSpec safeguards this workflow by guaranteeing identical numerical outcomes across prefill and decode stages.

  3. Scalability with Confidence:
    As context lengths and model sizes grow, so do floating-point accumulation errors. ReproSpec provides a standard contract for predictable, reproducible performance in large-scale LLM deployments.


4. Implementation Considerations

  1. Metadata Tracking:
    The serving engine and kernel must coordinate to store and retrieve the ReproSpec metadata. This may involve hashing relevant kernel templates, CTA tiling parameters, etc.

  2. Kernel Dispatch:
    The runtime should maintain a mapping from ReproSpec to a specific kernel with the kernel configuration in the spec.

  3. Performance vs. Reproducibility:
    The serving engine should batch requests depending on whether strong reproducibility is required or not.

    • High-Performance Mode: No strong reproducibility constraints, allowing flexible batching and kernel choices.
    • Strong Reproducibility Mode: All requests sharing the same ReproSpec are batched together, ensuring identical numerical results.

cc @Ying1123 @xiezhq-hermann

@yzh119 yzh119 changed the title [RFC]: Introducing ReproSpec for Strong Reproducibility in FlashInfer [RFC]: Introducing ReproSpec for Strong Reproducibility in LLM Inference Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant