Fix logprobs when multiple tokens are returned at once. #141

zewt · 2024-06-24T23:52:11Z

This fixes a few issues with logprobs:

Return the tokens that were actually selected in logprobs.tokens and logprobs.token_logprobs (currently they use the first entry in top_logprobs).
Handle the text including multiple tokens, including each token in the logprobs arrays.
Fix top_logprobs flattening multiple tokens. Currently if there are two tokens in the entry, top_logprobs includes the top N tokens from each in a single flattened array. Return each token's logprobs in its own element instead.
Include an index for each token in text_offset.

Here's an example of the current output. To reproduce this more easily, I set "Helloxxx" as a stop string, which causes "Hello" + " !" to be returned together by exllamav2:

"choices": [{
    "text": "Hello!",
    "logprobs": {
        "text_offset": [6],
        "token_logprobs": [-0.31962889432907104],
        "tokens": ["Hi"],
        "top_logprobs": [{
            "Hi": -0.31962889432907104,
            "Hello": -1.3464146852493286,
            "HI": -5.0294508934021,
            "Hey": -5.989271640777588,
            "_HI": -7.116503715515137,
            "!": -1.1828906536102295,
            "there": -7.935122966766357,
            " There": -7.935122966766357,
            " THERE": -8.649408340454102,
        }]
    }
}]

Note that "tokens" is "Hi", even though the actual text is "Hello!", and the logprobs for the two are lumped together. With this update:

"choices": [{
    "text": "Hello!"
    "logprobs": {
        "text_offset": [0, 1],
        "token_logprobs": [-1.3473599425440113, -1.253579154961466],
        "tokens": ["Hello", "!"],
        "top_logprobs": [{
            "Hi": -0.3383535146713257,
            "Hello": -1.2981750965118408,
            "HI": -4.947728633880615,
            "Hey": -6.052639484405518,
        }, {
            " there": -0.37861719727516174,
            "!": -1.159867286682129,
            "there": -7.878617286682129,
            " There": -7.92326021194458,
        }]
    }
}]

On the chat completion side, with a similar output where "Hello" + "!" are returned together:

"choices": [{
    "message": {
        "role": "assistant",
        "content": "Hello! It's nice to meet you."
    },
    "logprobs": {
        "content": [{
            "token": "Hello",
            "logprob": -0.0008049269672483206,
            "top_logprobs": []
        }, {
            "token": " It",
            "logprob": -0.16903051733970642,
            "top_logprobs": [
                { "token": "Hello", "logprob": -0.0008049269672483206, "top_logprobs": null },
                { "token": "Hi", "logprob": -8.594554901123047, "top_logprobs": null }
            ]
        }],
    }
}]

The tokens are mismatched: the "!" token is missing and the top_logprobs are off by one. This now returns:

"choices": [{
    "message": {
        "role": "assistant",
        "content": "Hello! It's nice to meet you."
    },
    "logprobs": {
        "content": [{
            "token": "Hello",
            "logprob": -0.0008823590930566984,
            "top_logprobs": [
                { "token": "Hello", "logprob": -0.0008823591051623225 },
                { "token": "Hi", "logprob": -8.235257148742676 },
            ]
        }, {
            "token": "!",
            "logprob": -0.0008548574740537445,
            "top_logprobs": [
                { "token": "!", "logprob": -0.0008548574987798929 },
                { "token": " there", "logprob": -7.094604969024658 }
            ]
        }]
    }
}]

A couple things that still need to be figured out:

I'm not sure if text_offset supposed to be the offset into the text string (this is close to what it was doing before, so I went with that for now), or the offset into the full context. I can't find OAI docs on this, but from some API snippets I've seen it might be the latter. (It's simple to derive from the other data, so maybe nobody's actually using this field right now.)
Results are odd when token healing is enabled, since the regenerated initial token is included in the list. For example, if the context was "https://", and token healing backs up by three characters and generates "://www", it currently returns that whole underlying token (and a text_offset of -3, since the token starts three characters before the start of the output). But from the client's perspective all that the model actually generated was "www". The token healing overlap should probably be trimmed off from the output, so concatenating the "token" in each entry always gives the same result as "text". I'll return to this after discussion.

This also brings the two chat/completions code paths back into alignment.

zewt · 2024-07-18T07:54:11Z

Token healing is tricky. I think the behavior I described above is correct (if the tokens in logprobs are "http://" and "https://", but "http" is part of token healing overlap and not actually output, the API should strip it out from logprobs too and return "://" and "s://"). But I think doing this needs more information from exllamav2. I tried implementing this by looking at the lengths of the tokens to make a guess, but that's not correct in general (for example, it's wrong if skip_special_tokens is false).

text_offset has a similar problem: it advances by the length of the token, but that's wrong in several cases (skip_special_tokens false, token healing, perhaps others). Maybe the same information for token healing would help here too, like exllamav2 calculating the offset into text (it has the missing info to do this correctly). I'm also still not sure whether text_offset is meant to be from the start of the response or context since I can't find OAI docs for it.

I think these are separate issues and should be explored separately from this patch.

zewt · 2024-09-25T20:26:29Z

Here's a simple repro:

curl 'http://10.0.0.7:5000/v1/completions' \
  -H 'Authorization: Bearer xxx' \
  -H 'Content-Type: application/json' \
  --data-raw '{"logprobs":2,"max_context_length":2048,"max_tokens":4,"token_healing":true,"prompt":"<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSay hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", "stop": "Helloxxx"}'

This is with turboderp/Llama-3.1-8B-Instruct-exl2_5.0bpw, but I think any Llama version will repro it. The "Helloxxx" stop sequence just makes it easier to repro since it triggers buffering in exllamav2 when the model says "Hello". response.txt

This shows the issues in original example:

top_logprobs shows ["Hello", " Hello", "!", " there"] in a single token. Two separate token results have been merged: "Hello" and "!".
The "tokens" array is missing "!".
It also shows the token healing problem: the first entry in tokens and top_logprobs show token healing regenerating "\n\n", which I think should be invisible to the API (this part isn't fixed by this PR).

One thing this doesn't show is that "tokens" always uses the first entry from top_logprobs, instead of the token that was actually chosen. To see that, set temperature to 2 and change the prompt to just "Hello!". The tokens array will be completely different from the actual results. Temperature will cause different tokens to be chosen, but the tokens array doesn't use them.

zewt force-pushed the logprobs branch 3 times, most recently from e55d3c7 to eefb572 Compare June 25, 2024 00:04

Fix logprobs when multiple tokens are returned at once.

7ebb0b2

zewt force-pushed the logprobs branch from eefb572 to 7ebb0b2 Compare June 25, 2024 00:10

Logprobs fixes for streaming chat/completions.

958e222

This also brings the two chat/completions code paths back into alignment.

zewt force-pushed the logprobs branch from 3acab3b to 958e222 Compare June 25, 2024 01:47

zewt marked this pull request as ready for review July 18, 2024 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix logprobs when multiple tokens are returned at once. #141

Fix logprobs when multiple tokens are returned at once. #141

zewt commented Jun 24, 2024

zewt commented Jul 18, 2024

zewt commented Sep 25, 2024

Fix logprobs when multiple tokens are returned at once. #141

Are you sure you want to change the base?

Fix logprobs when multiple tokens are returned at once. #141

Conversation

zewt commented Jun 24, 2024

zewt commented Jul 18, 2024

zewt commented Sep 25, 2024