Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality from TypeScript implementation #8

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

DePasqualeOrg
Copy link
Collaborator

@DePasqualeOrg DePasqualeOrg commented Dec 11, 2024

I used Claude 3.5 Sonnet in several iterations to add functionality from the huggingface.js implementation. Honestly, I don't understand everything, and I was hoping this could be a quick fix for vision language models and function calling. It probably needs to be cleaned up, at the very least. Maybe you can take a look and see if it's useful at all.

@johnmai-dev
Copy link
Owner

Thank you for your PR, it's great! Could you please provide some test cases?

I may not have time until next month. I've been a bit busy lately.

Hi @pcuenca ! Do you have time to help review this PR?

@DePasqualeOrg
Copy link
Collaborator Author

DePasqualeOrg commented Dec 12, 2024

I'll add some tests for images and function calling and try to polish this up a bit.

I should have also formatted the code before editing it, to make the changes more legible. After this gets merged, maybe we can add some auto-formatting.

@DePasqualeOrg DePasqualeOrg marked this pull request as draft December 12, 2024 08:35
@johnmai-dev johnmai-dev linked an issue Dec 12, 2024 that may be closed by this pull request
@DePasqualeOrg
Copy link
Collaborator Author

I was able to add some tests for text-only chat and images with Llama-3.2-11B-Vision, but I'm having a lot of trouble with the tool call test in testLlama32ToolCalls. I'm hitting the limit of what I can do, and I'll need some help from people more knowledgeable than me. @pcuenca, any tips on how to proceed here?

@pcuenca
Copy link
Collaborator

pcuenca commented Dec 12, 2024

Hi @DePasqualeOrg, thanks a lot for the effort! It's a long diff, I can try to take a look in a couple of days. Do we need everything at once, including namespaces, built-in functions and tool calling, or could this potentially be approached in a few phases?

@DePasqualeOrg
Copy link
Collaborator Author

DePasqualeOrg commented Dec 12, 2024

Tool calling could definitely be postponed for later. That seems to be more complicated to test for anyway. Since we now have support for vision language models in mlx-libraries, maybe that should be the focus here.

If it's too hard to separate out the changes, maybe we can focus on test coverage for image models in this PR, and add tests for function calling in a later PR. I'll try to add tests for other vision language models and mark this PR as ready for review when I've covered the most popular ones.

@DePasqualeOrg
Copy link
Collaborator Author

It looks like Qwen2-VL is the only other major vision language model that includes image handling in the chat template, so now we have tests for image handling in Qwen2-VL and Llama 3.2, and I think this is ready for review.

Sorry in advance for the somewhat messy approach here. I did actually spend the better part of a day putting everything together. I'm sure it's not perfect, but I wanted to get the ball rolling on this.

One thought I had was that we can benefit from the work that has already been done on the TypeScript implementation by keeping the Swift implementation as close to it as possible. As LLMs get even better, they'll be able to port things more easily, and keeping the libraries roughly isomorphic would help with that.

@DePasqualeOrg DePasqualeOrg marked this pull request as ready for review December 12, 2024 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse Llama tool calls?
3 participants