Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Nov 11, 2025

This PR adds a generator-based API for receiving task results. It aims to reduce the usage of callback function, making the code looks more "linear", easier to follow.

This also allowing to return correct HTTP error code in streaming case, ref: #16486 (comment)

Example:

server_response_generator gen(ctx_server);
{
    std::vector<server_task> tasks;
    // ... populate tasks ...
    gen.post_tasks(std::move(tasks));
}

// wait for the results
auto all_results = gen.wait_for_all(req.is_connection_closed);

// collect results
if (all_results.is_terminated) {
    return; // connection is closed
} else if (all_results.error) {
    res_error(res, all_results.error->to_json());
    return;
} else {
    for (auto & res : all_results.results) {
        GGML_ASSERT(dynamic_cast<server_task_result_embd*>(res.get()) != nullptr);
        responses.push_back(res->to_json());
    }
}

@ngxson
Copy link
Collaborator Author

ngxson commented Nov 11, 2025

Trying to address https://github.com/ggml-org/llama.cpp/pull/16486/files#r2419474810 in the meantime

Edit: resolved in 31b8b70


// next responses are streamed
json first_result_json = first_result->to_json();
const auto chunked_content_provider = [first_result_json, gen, oaicompat](size_t, httplib::DataSink & sink) mutable -> bool {
Copy link
Collaborator Author

@ngxson ngxson Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: in the future, when we separate the HTTP implementation from the current code base, this chunked_content_provider callback pattern will disappear.

the goal is to make each server endpoint handler itself become a generator, which generate JSON response each time the next() function is called

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, since this chunked_content_provider lambda function is already a generator itself, we can just keep it and only change the return type.

the ultimate goal is to expose an API that allow writing code like this:

const auto handle_chat_completions = [&](const Request & req, Response & res) {
    auto body = json::parse(req.body);
    // ... do parsing stuff with body
    auto response = handle_completions_impl(...);
    if (response.stream) {
        // response is now a generator, call next() until returns false
        res.set_stream(true);
        json chunk;
        while (response.next(chunk)) {
            res.write(chunk.dump());
        }
        res.end();
    } else {
        // non-stream, response is simple object
        res.set_content(response.data);
    }
}

@ngxson
Copy link
Collaborator Author

ngxson commented Nov 12, 2025

I rename "generator" to "reader" as the term "generator" is better to be used to describe the interface between server_context and the HTTP layer.

In a follow-up PR, I'll separate all http-related code into its own API. The idea is that server_context returns a response_generator to HTTP layer, and HTTP layer simply call next() until there is no data left.

For now, this PR should be ready for review. No rush but CC @ggerganov for visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants