Description
Hey, thank you so much for the great model and this repo!
Would you be willing to add support for this chat format to llama-cpp-python, so that we can use function calling (and JSON mode) with their OpenAI compatible server?
Right now, llama-cpp-python offers the only OpenAI compatible server with constrained/grammar based sampling for CPU that I am aware of. It has been very convenient to use with the functionary models, as it is plug&play with the openai client and very reliable thanks to the grammar sampling.
Besides functionary, there is already support for a format called chatml-function-calling which might be similar enough to the Hermes format to be able to just adapt it instead of writing something from scratch:
All that would need to be added to the library is a handler like that.
Thanks!