-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BatchedExecutor Double Buffering #748
BatchedExecutor Double Buffering #748
Conversation
…ions to be prompted (adding tokens to one batch) while inference is still running on the other batch.
Martin,
So, what I mean here is that for an end user all of this will be very overwhelming and confusing. What they would want is simple, but real word scenarios that they can immediately understand and use in their work. |
To some extent that's expected, the That said, any feedback that makes it nicer to use is valuable, keep it coming 😁
I used clone when first developing this and switched to calling it fork instead. To me "Clone" has a connotation of copying the conversation which is definitely not what is happening! A fork of a conversation is basically free in terms of memory usage. It's analagous to the unix fork system call which "forks" a process without copying the memory involved.
Yeah agreed, it's all a bit ugly. Due to the low level nature of this executor the examples are a bit hard to follow. In this case the "Node" class is representing the tree of conversations, each "Split" of a node corresponds to a "Fork" of the conversation. An easy improvement would be to at least rename "Split" to "Fork" so the same terminology is used everywhere!
You should see an output like this: Where you can follow any line (such as the one I've highlighted) to get a reply. In this example: "Not many people know that in addition to his numerous films and televisions series, the actor George Peppard, best known as Shepard Fairey on "Fawlty Towers," had quite an interesting voice career. He recorded narrations for dozens of documentaries, films, and cartoon shows. Let's take a closer look at some of his best roles".
I think this could be switched to use the
This goes back to the low level nature of the sampler. If you want to build a custom executor, or define a custom sampling mechanism dealing directly with raw logits these are the things you would be using. Long term, there should be new executors which wrap this stuff up in more limited but much more "user-friendly" classes. We've already got that to some extent, where the current high level executors can accept an
This is a terrible name, it should really be something like " Originally these demos were written by porting across bits of llama.cpp demos (with all the terrible C++ naming conventions) and calling the low level native functions.
Say you have a conversation that generates tokens This is not equivalent to save/load for two reasons. The first is simply usability - if you can't rewind you'd need to know ahead of time when you need to save a "checkpoint" to rewind to. Saving is fairly expensive (it's a large copy), so you want to minimise the need to defensively save just in case you ever want to come back to this point. More importantly, it's much less memory efficient to save and load! If you save a state and then load it into a new conversation you have copied that entire state, consuming a lot of extra memory. Wheras if you rewind you are using the same memory cells for |
OK, if this is for library developers, then I would like to suggest to create a separate solution for this type of examples. The 'examples' should be helping the users of the library and not confusing them 😵. Or when you distribute the code to low level and high level parts, maybe this can then be done by moving these examples to the low level part. I think that several names and logic should be changed even for library developers. I will not repeat the above, but just mention that you did not convince me with your explanation of 'fork' 😅. If I fork a github repository, then it is copied completely and I do not use unix, so nothing helps here. Maybe you should find a better name. It would probably be easier to understand if you would 'Clone' the context, which probably means what you mean, if I understand you well. |
@zsogitbe For the current implementations in LLamaSharp, there's little semantic difference between using Beam search in LLM can be described as following steps.
After the Taking your analogy of repo clone&fork, please regard There's also another reason: after forking the sequence, the kv-cache of it will not be copied at once. The copying happens only when one of the sequences is used to generate the next token. |
The copying doesn't happen then either! If you do this:
Then you will have 9 occupied KV slots, like this: This is why the save and load method for cloning a conversation, instead of forking it, is bad. As soon as you save and load the slots are no longer shared and you are taking up a lot more memory and compute. |
Added "double buffering" to the batched executor, this allows conversations to be prompted (adding tokens to one batch) while inference is still running on the other batch.
There's some special consideration given to errors. If an error occurs during inference the previous batch is swapped back to be the "next" batch. This allows infer to be called again (after fixing whatever caused the error).