Replies: 5 comments
-
For anyone wanting to do this, see an initial attempt in #77, and in particular this comment on ggerganov's preferred approach. Should be pretty straightforward I think. |
Beta Was this translation helpful? Give feedback.
-
It is already ongoing , check PR #77 |
Beta Was this translation helpful? Give feedback.
-
Yes, see the comment #77 (review) as @bakkot suggested. This is the way 🦙 |
Beta Was this translation helpful? Give feedback.
-
ah @bakkot beat me to it while i was writing. @Ronsor please close this , and the project have Discussion now https://github.com/ggerganov/llama.cpp/discussions |
Beta Was this translation helpful? Give feedback.
-
I propose refactoring
main.cpp
into a library (llama.cpp
, compiled tollama.so
/llama.a
/whatever) and makingmain.cpp
a simple driver program. A simple C API should be exposed to access the model, and then bindings can more easily be written for Python, node.js, or whatever other language.This would partially solve #82 and #162.
Edit: on that note, is it possible to do inference from two or more prompts on different threads? If so, serving multiple people would be possible without multiple copies of model weights in RAM.
Beta Was this translation helpful? Give feedback.
All reactions