-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality from TypeScript implementation #8
base: main
Are you sure you want to change the base?
Conversation
Thank you for your PR, it's great! Could you please provide some test cases? I may not have time until next month. I've been a bit busy lately. Hi @pcuenca ! Do you have time to help review this PR? |
I'll add some tests for images and function calling and try to polish this up a bit. I should have also formatted the code before editing it, to make the changes more legible. After this gets merged, maybe we can add some auto-formatting. |
I was able to add some tests for text-only chat and images with Llama-3.2-11B-Vision, but I'm having a lot of trouble with the tool call test in |
Hi @DePasqualeOrg, thanks a lot for the effort! It's a long diff, I can try to take a look in a couple of days. Do we need everything at once, including namespaces, built-in functions and tool calling, or could this potentially be approached in a few phases? |
Tool calling could definitely be postponed for later. That seems to be more complicated to test for anyway. Since we now have support for vision language models in mlx-libraries, maybe that should be the focus here. If it's too hard to separate out the changes, maybe we can focus on test coverage for image models in this PR, and add tests for function calling in a later PR. I'll try to add tests for other vision language models and mark this PR as ready for review when I've covered the most popular ones. |
It looks like Qwen2-VL is the only other major vision language model that includes image handling in the chat template, so now we have tests for image handling in Qwen2-VL and Llama 3.2, and I think this is ready for review. Sorry in advance for the somewhat messy approach here. I did actually spend the better part of a day putting everything together. I'm sure it's not perfect, but I wanted to get the ball rolling on this. One thought I had was that we can benefit from the work that has already been done on the TypeScript implementation by keeping the Swift implementation as close to it as possible. As LLMs get even better, they'll be able to port things more easily, and keeping the libraries roughly isomorphic would help with that. |
I used Claude 3.5 Sonnet in several iterations to add functionality from the huggingface.js implementation. Honestly, I don't understand everything, and I was hoping this could be a quick fix for vision language models and function calling. It probably needs to be cleaned up, at the very least. Maybe you can take a look and see if it's useful at all.