Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] generator.iterate() returns corrupted result objects in some cases #689

Open
3 tasks done
p-e-w opened this issue Nov 29, 2024 · 8 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@p-e-w
Copy link
Contributor

p-e-w commented Nov 29, 2024

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.5.0

Model

No response

Describe the bug

This is an actual result object I received from generator.iterate():

{
  "job": ExLlamaV2DynamicJob 941,
  "stage": "streaming",
  "eos": False,
  "serial": 941,
  "identifier": 941,
}

As you can see, eos is False, but the fields text, token_ids, etc. are all missing. Thus this "result" object contains no information at all. I have not been able to determine yet whether there are tokens missing from the output because of this, or this is a non-event that gets emitted for some reason.

This is a VERY rare occurrence. I have to start hundreds of jobs and generate between 10,000 and 50,000 tokens in order for this to happen. Perhaps a race condition somewhere in the threading logic?

Reproduction steps

Any use of the dynamic generator appears to trigger this eventually, provided enough jobs are started and enough tokens are generated. Just place a check like

if result["stage"] == "streaming":
    assert result["eos"] or ("text" in result)

in the loop and eventually you will get an error.

Expected behavior

If EOS has not been reached, results should contain token data.

Logs

No response

Additional context

No response

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@p-e-w p-e-w added the bug Something isn't working label Nov 29, 2024
@turboderp
Copy link
Member

Not every iteration will emit text data. Every iteration processes one token per active job, but one token doesn't always result in text output. This can happen if there's a partial match to a banned string or stop string, which will hold any text until the condition is resolved. There are also plenty of characters that are encoded as multiple tokens, in which case you may only get text on every 2nd or 4th iteration or whatever.

I would simply use result.get("text", "") if you want to interpret zero tokens as an empty string.

@p-e-w
Copy link
Contributor Author

p-e-w commented Dec 2, 2024

But then why is the result object emitted at all? AFAICT, such objects contain no actionable information. It's not even possible to tell which of the conditions you describe has occurred. What is the user supposed to do with such a "result"?

@turboderp
Copy link
Member

It's a design choice.

The generator performs one iteration per call to iterate(), which is one forward pass through the model with a sampling step at the end. Some iterations produce one or more characters of text, and some result in less than one character.

It's technically possible to return less than one character (as a byte array containing incomplete UTF-8 codes), but I don't think the Tokenizers library allows for it in a clean way and it would complicate client code a lot having to deal with that.

The other option is for iterate() to run an arbitrary number of iterations until there is decoded text to emit. This would complicate the control flow a bit, though, and make timing less predictable from the client's perspective. It's also not clear how to deal with the case where you have two or more concurrent jobs, and one of them samples the first two bytes of a three-byte emoji. Should all the other jobs stall while the one with the incomplete character runs a single pass at bsz 1? Or should they all sample one more token in that case? What if two jobs get out of phase when generating strings of Chinese characters that are two tokens each? This could lead to long stretches of no output at all.

Alternatively, an iteration could simply not return a result for a job that doesn't produce output, as opposed to returning an empty result. To me these seem roughly equivalent, but I went with the latter because it at least is a way for the client to confirm that a job is running and making progress, even when that progress hasn't yet produced a decodable string.

@p-e-w
Copy link
Contributor Author

p-e-w commented Dec 4, 2024

Alternatively, an iteration could simply not return a result for a job that doesn't produce output, as opposed to returning an empty result.

That's indeed the behavior I would have expected. I agree that the other two options are bad, but the current approach is a bit as if an HTTP server framework were to emit empty request objects to the handler while request chunks arrive over TCP, and then emit the populated object once the complete request has been parsed.

In general, I believe that objects should make as many guarantees about their structure as possible to simplify downstream code. The reasonable assumption

result["eos"] XOR ("text" in result)

does not hold with the current implementation, and that forces the user to either complicate their state machine or to weaken lookups like you described. IMO it would be preferable to just not receive such incomplete results in the first place. Alternatively, they could contain a field that specifies why there is no token content, but I see very little value in that from a user perspective.

@DocShotgun
Copy link
Contributor

It's in theory possible for the generator to not produce any decodable text for a significant amount of time (say, in the scenario where it keeps producing banned strings and then having to rewind). Would it not be better to at least yield some response so that the client knows that it's still making progress?

@turboderp
Copy link
Member

result["eos"] XOR ("text" in result)

I don't really agree with this. It works when the stop condition is specifically an EOS token, but for stop strings and for reaching the token limit, you'd have to run dummy iterations for jobs that have already reached an end state in their previous iteration.

As it is I don't think it's really hard to deal with. EOS also isn't a condition you necessarily have to react to, since the job is automatically removed from the queue as it ends. So all the client needs to do is:

while generator.num_remaining_jobs():
    results = generator.iterate()
    for result in results:
        identifier = result["identifier"]
        outputs[identifier] += result.get("text", "")
        # Optionally, if you want to react precisely when each job finishes
        if result["eos"]:
            print(outputs[identifier])
# generator is now finished with all jobs enqueued prior to the loop

There's also the async generator that allows you to create jobs as individual async generators. This still runs iterate() under the hood and would be complicated by adding dummy iterations or by changing the convention that one iteration = one token.

my_job = ExLlamaV2DynamicJobAsync(...)
for result in my_job:
    print(result.get("text", ""), end = "")
# generator is now finished with my_job but may have other active or pending jobs from other coroutines

@p-e-w
Copy link
Contributor Author

p-e-w commented Dec 6, 2024

I don't really agree with this. It works when the stop condition is specifically an EOS token, but for stop strings and for reaching the token limit, you'd have to run dummy iterations for jobs that have already reached an end state in their previous iteration.

I completely agree that there shouldn't be "dummy iterations" in the background. But that doesn't mean that the generator should emit result objects for jobs for which there aren't any actual results.

This isn't how most other libraries operate either. An HTTP server doesn't push incomplete requests to the handler just to inform it that it is still listening on the socket. If something goes wrong during generation, I expect the library to raise an exception, rather than telling me intermittently "yes, I'm still making progress, although I don't have anything to show yet".

The way the generator operates is fine; I just think iterate() should filter out result objects that don't contain results.

@turboderp
Copy link
Member

But then it comes down to whether you get an empty list of results or a list of empty results. Either way, I don't see a big difference in terms of how the client has to react, but the latter conveys a bit more information (specifically, which jobs are currently active, though more metadata might be added in the future) that could be of use to someone, and which you could filter out easily if it's not useful:

results = [r for r in generator.iterate() if "text" in r]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants