-
Notifications
You must be signed in to change notification settings - Fork 13.7k
server: (refactor) implement generator-based API for task results #17174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
server: (refactor) implement generator-based API for task results #17174
Conversation
|
Trying to address https://github.com/ggml-org/llama.cpp/pull/16486/files#r2419474810 in the meantime Edit: resolved in 31b8b70 |
tools/server/server.cpp
Outdated
|
|
||
| // next responses are streamed | ||
| json first_result_json = first_result->to_json(); | ||
| const auto chunked_content_provider = [first_result_json, gen, oaicompat](size_t, httplib::DataSink & sink) mutable -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: in the future, when we separate the HTTP implementation from the current code base, this chunked_content_provider callback pattern will disappear.
the goal is to make each server endpoint handler itself become a generator, which generate JSON response each time the next() function is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thought, since this chunked_content_provider lambda function is already a generator itself, we can just keep it and only change the return type.
the ultimate goal is to expose an API that allow writing code like this:
const auto handle_chat_completions = [&](const Request & req, Response & res) {
auto body = json::parse(req.body);
// ... do parsing stuff with body
auto response = handle_completions_impl(...);
if (response.stream) {
// response is now a generator, call next() until returns false
res.set_stream(true);
json chunk;
while (response.next(chunk)) {
res.write(chunk.dump());
}
res.end();
} else {
// non-stream, response is simple object
res.set_content(response.data);
}
}|
I rename "generator" to "reader" as the term "generator" is better to be used to describe the interface between In a follow-up PR, I'll separate all http-related code into its own API. The idea is that For now, this PR should be ready for review. No rush but CC @ggerganov for visibility. |
This PR adds a generator-based API for receiving task results. It aims to reduce the usage of callback function, making the code looks more "linear", easier to follow.
This also allowing to return correct HTTP error code in streaming case, ref: #16486 (comment)
Example: