-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running without Openapi / vLLM #59
Comments
Hi sashokbg, |
Hello @khai-meetkai thank you very much for the detailed explanation. I will try it as soon as I can at home and will get back to you and close this ticket :). |
Hello @khai-meetkai I have just tested the model with the tutorial you have provided and it works very well ! Especially given that we can run this on a local machine and play around as much as we want for free :) |
Woo, I was able to get this working on Apple's Metal Performance Shaders and with Chatlab's function registry. I'm using |
@rgbkrk we are training a new functionary model with the ability to call multiple functions in parallel; it is similar to OpenAI parallel function call. Hope that Chatlab will support this soon ? |
I'll make sure to support it soon for your coming launch. Same format as OpenAIs I assume? |
@rgbkrk, Yes we manage to have the same format for both streaming and non-streaming |
Tracking the work in rgbkrk/chatlab#118 Just to wrap up / showcase the incredible power here, I made a little video I've posted the same to twitter as well: https://twitter.com/KyleRayKelley/status/1730296106695979273 |
What exactly did you do to get it running with MPS on your ARM M1/M2 processor? |
I ran it using |
Would be great to have some instructions here. Thanks! 🙂 |
@ChristianWeyer |
@rgbkrk @khai-meetkai , May I ask is it possible to use |
@sandangel I'm sure there's a way. Functionary requires additional steps for inference because of the function & tool calling so you'd have to port some of what's in this repo over to mlx usage. |
@rgbkrk thanks a lot for your comment. Could you help point me to where I should start with? I really appreciate it. 😀 |
Start by looking at how |
Hello, due to GPU constraints I am trying to run the model using a C++ implementation - https://github.com/ggerganov/llama.cpp
This involves converting and quantizing to 4 bits as well as using the runner from the llama.cpp.
What would be a properly created prompt in a "classic running" ?
I tried just passing the functions and messages in the prompt like so but it did not work (inference was not accurate):
The text was updated successfully, but these errors were encountered: