-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usability bug]: KV-cache LLMs are difficult to add to MLAgility #315
Comments
Addendum: the Setting I see this as further data that the approach of supplying dummy inputs is not scalable as models become increasingly complex. |
@danielholanda curious to get your thoughts on this now that you're back! |
Let me know if I'm understanding this correctly. The challenging you are describing is the amount of work needed to create template models like If that is the case, our plans to take the shape of model inputs into account when calculating the hash should solve this issue. |
@danielholanda interesting! Is that because we could pass a true application in, which would perform prefill followed by KV-cached-generation, and Some follow up questions/challenges though:
|
Exactly. If the same model is executed with N different input shapes, then N models will be detected. It is also true that the only information we provide to the user to differentiate between those models is the hash. We might want to also display input shapes when the same model is executed more than once with different inputs.
I agree that this would be nice as long as we can do it in a programatic way.
Very interesting point. Another alternative here is to set the maximum number of "model variants" to execute per model. Note: model variant = same model with different input shapes |
Unlike most Transformer models, which have a simple
forward()
signature, KV-cache LLMs have a complex signature that is difficult to encode into MLAgility's model template.For example, LLaMA with no KV-cache:
LLaMA with KV-cache enabled:
It was painful to figure out the details of the latter code, especially the specific value that needed to be assigned to
position_ids
to make everything work.The reason for this interface complexity in the first place is that
huggingface transformers
doesn't expect anyone to invoke a KV-cache transformer a single time like this. They expect an app that actually maintains the cache. And since the cache inputs come from the model outputs, app developers dont have to think about how to format those values (woo python).However, mlagility would also not work well with such an app because the first invocation of the model would be a prefill invocation (no KV cache used) and then the subsequent invocations would be generation (KV-cache used). These two invocation modes generate completely different ONNX files and benchmark results, yet MLAgility doesn't offer a clear way to distinguish between the two (that I know of).
Filing this issue to keep track of the problem and any potential solutions. cc @danielholanda @ramkrishna2910
The text was updated successfully, but these errors were encountered: