-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous aggregation jobs #3451
Comments
There are a few ways we can support asynchronous aggregation. Leader changesIn all situations, we need to change the leader such that it supports a helper behaving
I lean towards approach 1, since it can be straightforwardly accomplished. Depending on the outcome The database schema changes look invasive, so we might at least want to at least sketch out approach I believe Daphne will be implementing helper-side async, so we should be able to use it for Helper changesWe have options for how we want to implement helper-side async.
If adopting 1, we still may want to make the schema changes necessary for 2, in case we want to #3331 is available as a reference of what approach 2 looks like, but I would probably start from a Open Questions/Conclusions
We may need to speak with stakeholders to determine where they lean as well. |
LeaderAs I see it, the primary operational difference to approach #1 is that we will retransmit the request with each polling attempt, wasting bandwidth. Approach #2 does not have this (but does require more implementation effort). Are there other differences? If that is the score, I think we should strongly consider implementing #2. Either way, we should definitely sketch out the schema changes required for #2. Separately, #1 might not save as much implementation effort as we expect: currently the aggregation job driver expects to be able to pick up any unclaimed work item. Without some work to ensure appropriate backoff, we might end up polling the same aggregation job very frequently. (We'd need something similar for #2.) HelperWe don't need to implement this immediately/at all -- #1 is allowed by the spec. I think we should implement the schema changes and consider kicking the functional implmenetation down the road. |
Not quite. In both cases there shouldn't be any unnecessary retransmission of request. The key difference is just when we poll the aggregation job. At the moment, we do roughly // in step_aggregation_job_aggregate_init()
// Calculated in code previous to this.
let job: AggregationJob;
let leader_side_state; // leader-side prepared report aggregations
let resp = {
let request = AggregationJobInitializeReq::new(...);
send_request_to_helper(request, ...);
};
process_response_from_helper(resp, leader_side_state); Approach 1 prescribes that we simply do the polling inline, roughly: let backoff = ExponentialBackoff::default();
let resp = {
let request = AggregationJobInitializeReq::new(...);
let resp = send_put_request_to_helper(request, ...);
match resp.status {
AggregationJobStatus::Ready(prepare_resps) => prepare_resps,
AggregationJobStatus::Processing(uri_poll_to_at) => {
loop {
let resp = send_get_request_to_helper(uri_to_poll_at);
match resp {
AggregationJobStatus::Ready(prepare_resps) => break prepare_resps,
AggregationJobStatus::processing => backoff.wait(),
}
}
}
}
}
process_response_from_helper(resp, leader_side_state); The drawback being that the job driver is now sleeping without doing productive work between polling attempts (i.e. it's not servicing other waiting aggregation jobs). Approach 2 prescribes that we do this: // Calculated in code previous to this,
let job: AggregationJob;
let leader_side_state; // leader-side prepared report aggregations
let resp = {
let request = AggregationJobInitializeReq::new(job, ...);
let resp = send_put_request_to_helper(request, ...);
match resp.status {
AggregationJobStatus::Ready(prepare_resps) => prepare_resps,
AggregationJobStatus::Processing(uri_poll_to_at) => {
job.set_state(AggregationJobState::PollingHelper, leader_side_state);
job.poll_again_after(some_duration)
return Ok(());
}
}
}
process_response_from_helper(resp, leader_side_state); Then we have separate logic in the job acquirer that dispatches on this new state: let aggregation_jobs = get_aggregation_jobs_in_non_terminal_state();
for job in aggregation jobs {
match aggregation_jobs.state() {
AggregationJobState::InProgress => { do_whatever_we_do_normally() }
AggregationJobState::PollingHelper => {
let leader_state = load_checkpointed_leader_state(job);
poll_aggregation_job(job, leader_state)
}
}
} where let resp = {
let resp = send_get_request_to_helper(job.uri_to_poll, ...);
match resp.status {
AggregationJobStatus::Ready(prepare_resps) => prepare_resps,
AggregationJobStatus::Processing(uri_poll_to_at) => {
job.poll_again_after(some_duration);
return Ok(());
}
}
}
process_response_from_helper(resp, report_states); The drawback being more complexity and needing to checkpoint the leader-side report aggregation state to the database. But this should make better use of aggregation job driver time. |
Apologies for the misunderstanding -- I think we should definitely go with Leader approach 2. Taking up an aggregation job driver "slot" doing exponential backoff polling is very likely to negatively impact the throughput of the system. Instead, if the Helper is not ready, we should put the job back onto the queue (likely with exponential backoff over repeated attempts?) and look for different work to pick up. |
SGTM. FWIW this problem more or less already exists, because we exponentially backoff on transient retries to the PUT request. During which time the aggregation job driver is not doing productive work. It may be worth also considering what changes are needed to fix this, since we'll be overhauling this code anyway (but not a priority). |
I agree. The simplest way to address this might be to rip out that exponential-backoff polling as well, solving it the same way we plan to solve Helper polling: implementing the backoff based on placing the job back into the queue with an exponential backoff on how quickly it can be re-acquired. (Though this is off-topic from this issue, & optional.) |
DAP-12 supports handling aggregation jobs asynchronously. Determine how to implement this, and implement it.
The text was updated successfully, but these errors were encountered: