Feature Request: Add Min-P sampling layer #1154

aikitoria · 2024-02-25T12:57:06Z

It would be very nice if the library supported using Min-P sampling as an alternative to Top-P/Top-K. This became popular for local LLMs in the past few months because it provides significantly more useful results, or at least feels like it does. More info here: https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/

Most other libraries already support it, examples:
turboderp-org/exllamav2@0d436d7
ggerganov/llama.cpp#3841

This only requires a single parameter - consider all tokens whose probability is greater than than the probability of the first one scaled down by some number.

The text was updated successfully, but these errors were encountered:

aikitoria · 2024-02-29T17:15:23Z

Forgot to mention: this sampling method should be applied before temperature

0xymoro · 2024-03-11T22:20:33Z

@ncomly-nvidia seconded on adding min p - makes a noticeable impact on production, doesn't seem too bad to implement compared to some others.

aikitoria · 2024-04-12T14:47:33Z

@byshiue any chance of this being added soon? 👀

Most other engines have it now, it's in vLLM and HF transformers is also adding it.

nivekgnaij · 2024-04-14T15:57:01Z

^^ has a huge impact on production

aikitoria · 2024-04-21T18:59:27Z

If Nvidia doesn't want to do it (why? much superior inference results...), maybe we can add it ourselves? It looks like sampling layers are part of the code that is open source.

https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/samplingTopPKernels.h
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/samplingTopPKernels.cu

https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/layers/topPSamplingLayer.h
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/layers/topPSamplingLayer.cu

https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/layers/samplingLayer.cpp#L52

pathorn · 2024-05-02T23:29:10Z

@aikitoria I found that it was far easier and performant to implement in decodingCommon.cu since the same math used for logprobs can be used for calculating the relative threshold used in min-p sampling
See my PR in #1536

I'll need to review that things are done in the correct order: I'm still grappling with the codebase but I assumed it should be doing min-p before sampling

byshiue · 2024-05-29T02:03:23Z

Also mentioned in this issue #1683.

cyr0930 · 2024-05-29T02:07:16Z

I hope this feature being added any time soon!

DreamGenX · 2024-05-29T13:31:28Z

You can implement this as a logit processor as far as I can tell:

def _get_min_p_fn(min_p: float):
    def _fn(
        logits: torch.Tensor,
    ) -> torch.Tensor:
        probs = torch.softmax(logits)
        top_prob = probs.max()
        scaled_min_p = min_p * top_prob
        tokens_to_remove = probs < scaled_min_p
        logits = logits.masked_fill(tokens_to_remove, -float("inf"))

        return logits

    return _fn

aikitoria · 2024-05-29T13:34:32Z

Is that not much slower than it would be if properly implemented it in the CUDA sampling layers?

Just saw the PR above also. That's an interesting way. Highly doubt nvidia would accept it given it seems more like a hack, but it gives us something to experiment with...

DreamGenX · 2024-05-29T13:43:42Z

@aikitoria Yeah, since it's computed per request and not in a batch (see also #1681). But if you are already using other logit processors, it might not have a big of an effect.

DreamGenX · 2024-06-14T05:17:03Z

FWIW the new executor API does not allow parametrizing logit processors per-request anymore -- they are fixed at startup -- so one can't implement MinP that way. You have to go lower-level to GptManager in C++, so bumping this thread @ncomly-nvidia @AdamzNV

hello-11 · 2024-11-15T10:45:57Z

cc @AdamzNV @ncomly-nvidia @laikhtewari for vis.

aikitoria · 2024-11-20T17:04:46Z

This would still be great to have by default, so we don't have to maintain a custom build just to change what the sampling layer does :)

aikitoria · 2024-12-30T14:07:42Z

I'm tired of waiting so I'm just doing it myself the way I thought it should be done... but it is EXTREMELY FRUSTRATING that this library has closed source parts for no reason. For example, I can't modify executor to actually pass the min_p parameter through from the frontend, or allow top_p and min_p to exist at the same time by grouping them in separate batches for sampling. Why???

FWIW I don't think the PR linked above is the correct way to do this. When min_p is in use, top_k and top_p layers should not run, and temperature needs to be applied after min_p filter. We can also fuse min_p filter, late temperature and sampling in a single kernel for best performance if logprobs is not required (I don't use it).

aikitoria · 2024-12-31T00:08:58Z

I've created a WIP branch with my experiment here: main...aikitoria:TensorRT-LLM:min-p

byshiue assigned ncomly-nvidia Feb 27, 2024

byshiue added the feature request New feature or request label Feb 27, 2024

pathorn mentioned this issue May 2, 2024

Use first bad_words as extra parameters, and implement min-p #1536

Draft

byshiue assigned AdamzNV May 29, 2024

byshiue mentioned this issue May 29, 2024

Any chance to adopt min_p sampling? #1683

Closed

user-0a mentioned this issue Oct 29, 2024

Min P support guidance-ai/llgtrt#4

Open

aikitoria mentioned this issue Jan 2, 2025

Implement Min-P sampling and late temperature adjustment as a fused sampling layer #2643

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add Min-P sampling layer #1154

Feature Request: Add Min-P sampling layer #1154

aikitoria commented Feb 25, 2024

aikitoria commented Feb 29, 2024

0xymoro commented Mar 11, 2024

aikitoria commented Apr 12, 2024

nivekgnaij commented Apr 14, 2024

aikitoria commented Apr 21, 2024

pathorn commented May 2, 2024

byshiue commented May 29, 2024

cyr0930 commented May 29, 2024

DreamGenX commented May 29, 2024

aikitoria commented May 29, 2024 •

edited

Loading

DreamGenX commented May 29, 2024

DreamGenX commented Jun 14, 2024 •

edited

Loading

hello-11 commented Nov 15, 2024 •

edited

Loading

aikitoria commented Nov 20, 2024

aikitoria commented Dec 30, 2024 •

edited

Loading

aikitoria commented Dec 31, 2024

Feature Request: Add Min-P sampling layer #1154

Feature Request: Add Min-P sampling layer #1154

Comments

aikitoria commented Feb 25, 2024

aikitoria commented Feb 29, 2024

0xymoro commented Mar 11, 2024

aikitoria commented Apr 12, 2024

nivekgnaij commented Apr 14, 2024

aikitoria commented Apr 21, 2024

pathorn commented May 2, 2024

byshiue commented May 29, 2024

cyr0930 commented May 29, 2024

DreamGenX commented May 29, 2024

aikitoria commented May 29, 2024 • edited Loading

DreamGenX commented May 29, 2024

DreamGenX commented Jun 14, 2024 • edited Loading

hello-11 commented Nov 15, 2024 • edited Loading

aikitoria commented Nov 20, 2024

aikitoria commented Dec 30, 2024 • edited Loading

aikitoria commented Dec 31, 2024

aikitoria commented May 29, 2024 •

edited

Loading

DreamGenX commented Jun 14, 2024 •

edited

Loading

hello-11 commented Nov 15, 2024 •

edited

Loading

aikitoria commented Dec 30, 2024 •

edited

Loading